About two decades ago,
I spent a year "polishing" the documentation
of A/UX, an early effort
by Apple at a Unix-based version
of Mac OS .
Using mechanized tools,
I was able to find numerous problems in the "man" pages.
Using similar tools,
I have been able to detect the same sorts of problems
in recent Mac OS X man pages.
Although it's tempting to say that some things never change,
Mac OS X is clearly quite different from A/UX.
For one thing,
Mac OS X installs about 100 times as many files as A/UX did.
Fortunately, hardware performance has gone up even faster.
Today's low-end Macs have about 1000 times the speed and storage
capacity of a Mac II.
Over the same period,
a wealth of interesting and useful tools have appeared
as Open Source offerings.
Some of these tools
(e.g., documentation generators, operating system monitors)
are clearly relevant to providing detailed information
on operating systems.
Unfortunately, these tools are not being used
on most operating systems.
As William Gibson put it:
"The future has arrived; it's just not evenly distributed."
This essay suggests ways that these tools could be used
to improve the state of Unix documentation.
It begins by covering some historic and existing work,
then goes on to speculate about ways to apply MBD
to the static and dynamic documentation of Unix systems.
Executive Summary:
Mechanized generation of software documentation is not a new idea.
Donald Knuth's literate programming dates back to the late 1970's.
It was inspired, in turn,
by Pierre-Arnoul de Marneffe's
work in "Holon Programming".
Dozens of documentation generators have been written,
supporting a variety of
programming languages,
RDBMSs (Relational Database Management Systems), etc.
However, I have not been able to find much material
(aside from my own :-)
on the mechanized generation of documentation
for entire operating systems.
After working on A/UX,
I began thinking about ways to compare and contrast
the documentation for various Unix systems.
To promote these ideas,
I eventually started the Meta Project,
creating a wiki and a "proof of concept" demo of its basic ideas.
The Meta Demo (written in 2001)
demonstrated that it is possible
to capture and document a large number of static relationships
(e.g., between files, man pages, and programs).
Given an entity (e.g., "
The demo code uses a combination of mechanically-harvested facts,
manually-entered hints, and hard-coded rules.
Specifically, it does a constrained depth-first search of a relationship graph,
starting at a specified node.
As a debugging aid, it can provide an
explanation of its reasons for displaying a given item.
Despite its simple design,
the demo functions quite well.
It responds quickly,
finding most of the items that an expert user would
(and some that s/he might not),
with very few "false positives".
Although the demo is a satisfactory "proof of concept",
it has a number of limitations
which should be addressed in any "real" support tool:
The demo doesn't have any direct way for users to contribute
(e.g., by posting comments or questions).
It should be easy for users to interact
with developers, other users, and the system.
In a distributed documentation system,
this interaction could cross site boundaries.
Indeed, the documentation systems could collaborate,
generating "baseline" information to aid in problem analysis, etc.
The demo code only harvests static information
(e.g., file content and metadata).
By tracking dynamic system activity
(e.g., with DTrace ),
a production version could detect relationships
that must be hand-entered for the demo.
A production system could also track and display
information on local changes, updates, etc.
Given the changing and eclectic nature of installed systems,
this could be quite valuable.
The demo only deals with files, ignoring processes, users, etc.
Also, differences between file types
(e.g., commands, libraries, man pages)
are handled in an ad hoc manner.
An ontology-driven
knowledge base framework such as
Protégé would be more powerful and less brittle.
The demo does not peruse source code
to find relationships between data structures, files, and functions.
Nor does it examine binary libraries, bug databases, control files,
or other plausible sources of information.
Input filters such as Doxygen could greatly extend the reach of a production version.
The demo has multiple instances,
each of which corresponds to a single OS version.
A production system should allow the user
to compare and contrast versions, etc.
Although no production system has been built,
we can speculate on the capabilities and design
that such a system might have.
Not surprisingly, this will be guided
by the limitations noted in the demo system.
Various sorts of input filters are possible.
Documentation generators such as
Doxygen and
Synopsis can be used to analyze source code and extract comments.
SNMP (Simple Network Management Protocol)
can be used to track network activity.
Command-line tools can collect information
on static and dynamic aspects of the system.
DTrace can be used to log "important" system calls
(e.g., exec, fork, link, open).
This log could be used to track run-time dependencies
between files, processses, etc.
Currently, DTrace is only available
on Solaris,
but it is being ported to FreeBSD and thus might appear in Mac OS X and other BSD-based systems.
Our goal is to support documentation and system administration,
rather than formal simulation,
so the modeling can be at a fairly coarse level:
files, groups, libraries, processes, programs, threads, and users.
Even so, a real (if tiny)
ontology will be needed.
Most of the entity classes will relate to files,
but we'll also need to cover processes and threads,
signals and sockets, users and groups, etc.
A few dozen types should get us started,
but this number should be expected to grow
(e.g., to several dozen types) over time.
The demo uses a Perl "tied hash"
(via Berkeley DB )
for its persistent storage.
Although a single hash can be used
to store assorted kinds of information,
this is not robust.
An RDBMS such as PostgreSQL can provide a wide range of features,
as well as the very desirable ACID (Atomicity, Consistency, Isolation, and Durability) characteristics.
If dynamic information is being captured,
data mining tools such as
Weka may become worthwhile.
Although this analysis would be performed by developers,
the results (e.g., improved models) could benefit all users.
As discussed in Mechanically-augmented wikis,
wikis such as MediaWiki have many features that would be useful in documentation systems.
For example, each wiki page has an associated "discussion" page,
providing a place for comments and suggestions.
The wiki can also notify users of changes,
allowing discussions to take place.
Although it isn't obvious how to do this,
the presentation mechanism should allow users
to make decisions about the "view"
(e.g., details, navigation options) they see.
More generally, presentation is a ripe area for experimentation.
Three fairly distinct levels of modeling
will be involved in this system.
Although each level deals with entities and relationships,
the levels' content, methods, and purposes vary substantially:
Program, Process, and File are abstract classes of entities.
If we say that "a Process runs a Program"
or that "a Process can read or write a File",
we define abstract classes of relationships.
These classes will form the project's high-level ontology.
I am currently working on a
"first cut" at this.
Feel free to take a look, make comments and/or suggestions, etc.
Based on the high-level ontology,
we can say things about concrete classes, such as:
"The (Process running the) Program vi
will attempt to read the File ~/.exrc, at startup.
This statement is consistent with the high-level ontology,
but it says things about particular classes
of entities and relationships.
This level is very relevant to (and visible in)
the documentation system.
If the user goes to the page for vi,
this information might show up as a link and/or
part of an image-mapped context diagram.
Based on the classes defined above,
we can look for instances that follow (or break!)
the defined relationships.
This can be used to verify or extend the model
and to support "dynamic" forms of documentaion.
Running DTrace, we might determine that Process 12345,
running the vi Program,
read ~rdm/.exrc at 01:23:45 this morning.
This may not sound very interesting,
but it can be used to add or verify a relationship class.
In addition, this type of information can be used
to generate dynamic documentation about a specific system.
For example, it might be used to answer questions such as
"What Program created this File?" or
"Which Programs have the most File activity?".
In summary, the model drives the system,
but the system helps to refine and extend the model.
This "virtuous cycle" should (it says here :-)
allow us to start with a simple docmentation system
and grow it as our interests and resources allow.
Next: Advice
We have the necessary knowledge and tools;
we just need to apply them!
Background
The Meta Demo
/usr/bin/vi")
or keyword (e.g., "signal"),
the demo can find dozens of related items.
Limitations
Speculation
Extraction
Analysis
Presentation
Levels of Modeling