In the meanwhile, although I am able to perform in-person demos of the site (contact me for details), I am unable to give out the password. I apologize (and sympathize) with any difficulty this may cause.
In a recent contract,
I was asked to create a comprehensive
web site,
providing both overview and detailed documentation
for a scientific software development project:
The FSW (Flight Software) group is creating software
to operate the LAT (Large Area Telescope) instrument
in the GLAST (Gamma-ray Large Area Space Telescope) satellite,
process the data it collects,
and output the resulting scientific information.
The "production" FSW code is being written in C and assembler.
It will be run on multiple, radiation-hardened PowerPC processors
under the VxWorks operating system.
However, development and test code must run in assorted environments,
including Intel-based Linux, SPARC-based Solaris, etc.
Aside from a few dozen hand-written web pages (e.g., tutorials),
the site's content is entirely computer-generated.
About half of the pages are generated by
Doxygen, a well-known
documentation generator;
the rest are generated by custom Perl scripts.
The input data comes from a variety of sources,
including databases, XML files, (electronic and printed) documents,
web pages, and assorted file formats
(e.g., configuration files, object libraries).
Information can be requested of the development engineers,
but there is no guarantee that they will have the time to reply.
In short, a fairly typical mechanized documentation scenario.
A number of goals influenced the design:
These goals are typical of many mechanized documentation projects.
Equally typical, but worth noting, are some omissions:
As discussed in the "Data Flow" section of the
concepts page,
MBD techniques can be implemented by means of a
DAG (Directed Acyclic Graph) of processing routines.
Each routine accepts one or more input data sets and, in traditional
batch processing fashion,
generates one or more output data sets.
This produces a very modular system,
because each routine interacts only with its input and output data sets.
This implementation consists of several dozen smallish Perl scripts,
supplemented by assorted command-line utilities
(e.g.,
YAML files are used
for both the hand-edited and intermediate data sets.
Because YAML is a simple and powerful data serialization format,
each file can be a nicely-formatted textual representation
of a loadable data structure.
After loading the input structure(s),
some scripts create "helper" data structures
(e.g., additional indexes into the data).
All text files, whether hand-edited or machine-generated,
are thoroughly commented.
Each generated file receives an informative header,
indicating the file's format, origins, purpose, etc.
Section comments provide context and ease navigation.
For a variety of (mostly pragmatic) reasons,
the project uses Open Source tools whenever possible.
The ability to modify code is important,
as are cost factors
and the ability to share applications with other institutions.
In any case, many Open Source standbys
(e.g., CMT, CVS, Doxygen, GCC, Graphviz, Groff, ImageMagick,
Linux, MySQL, Perl, Python, Swish-e) are in use.
That said, some proprietary software is also being used.
This includes database systems (e.g., Oracle),
operating systems (e.g., Solaris, VxWorks, Windows),
and applications (e.g., Adobe's Acrobat and Frame;
Microsoft's Excel, Outlook, and Word).
The decision to use proprietary software
is generally based on either familiarity
or the lack of an acceptable Open Source alternative.
Finally, quite a bit of the software infrastructure
has been developed from scratch or adapted from Open Source tools.
Aside from my own contributions,
there are a couple of "test executives",
an interactive packet specification application, etc.
The group also uses a version of CMT
which has been extensively modified to handle local requirements.
For more information, see this
introduction.
The documentation suite consists
of several dozen Perl scripts (~20K lines) and hand-edited
YAML reference files (~30K lines).
The suite is run early each morning,
by means of
Several "tricks" are employed to ease maintenance,
increase reliability, and optimize performance.
Large, hand-edited makefiles are tedious and error-prone to edit.
They are also difficult to document in an automated manner.
To resolve this problem, I went to a two-stage process.
A small, hand-edited makefile causes and controls the generation
of the "real", machine-generated makefile.
Each hand-edited file (whether data or script)
contains a set of YAML
"#DDF"
(Documentation Data Flow) declarations,
encoded as specially-formatted comments.
These are harvested at the beginning of each run
and used to generate both the production makefile
and a set of data flow diagrams.
Some scripts (e.g., extraction routines) are always run,
because there is no way to know whether their input has changed.
Consequently, the content of their output files may be unchanged.
Unfortunately,
The
Although a script may generate thousands of files,
there is no reason to bother
The DDF declarations, collectively,
provide an abstract model of the system's data flow.
Specifically, programs and (sets) of data files
are represented as nodes in a DAG.
Connections between nodes (e.g., read or write access, "include" usage)
are represented as edges.
Each hand-edited file "knows" its relative path, description,
label (for diagrams), and type.
Scripts, in addition, know which files they use or create.
Generated files are described by their originating scripts.
As odd as all this may sound,
this is only a slight variation on a traditional Unix-style
batch processing system.
The use of
One interesting aspect of this implementation
is that data structures are "first-class citizens" of the design process.
Given that
OOP (Object-oriented programming) techniques
are based on hiding data structures,
this may seem odd.
However, this approach appears to provide a great deal of modularity,
which is one of the major goals of the OOP approach.
It would be fairly trivial to convert this design
into an event-based system.
Given that the scripts are written in Perl,
I would probably turn to
POE (Perl Object Environment),
which supports a very flexible approach to event-based programming.
C++, Python, and Ruby have roughly equivalent systems,
known respectively as
ACE (Adaptive Communication Environment),
Twisted, and
dRuby (Distributed Ruby).
Although an event-based approach wouldn't need a generated makefile,
it would still be necessary to have a setup script,
in order to "program" the event distribution
and check for cycles in the data flow.
It would also be a good idea to use atomic file writes
(e.g., write a temporary file, then rename it to the "final" name)
to eliminate incomplete data transmissions.
The Unix community is rife with
"little languages"
such as
Aside from the
Most of these "languages", to be sure,
consist of simple, special-purpose macro expansions.
For example, I make frequent use of brace expansion (ala the shell)
to handle expressions such as "
In the case of the data flow animation sequences,
the "base" data flow diagrams are defined
by hand-coded
Even with the dense encoding that its little language affords,
the animation specification is several hundred lines of intricate YAML.
Without this compression, it would be ridiculously large,
completely impractical to edit, and far more subject to error.
In short, I believe that YAML-based little languages
are a powerful addition to the MBD developer's repertoire.
Next: Case Study(Unix)
Design Goals
Implementation
dot, make, troff).
Most of the scripts are under 500 lines in length,
but a couple exceed 1000 lines.
An ancillary library of perhaps 1500 lines
supplies all of the "generic" code.
Few of the routines, in any case, are difficult to comprehend.
Base Technologies
Data and Control Flow
cron and
make.
It produces (as needed) tens of thousands of files,
in a variety of formats:
dot, grap, troff).
file_update()
make just looks at file time stamps
(not file content),
so this could cause unnecessary processing.
file_update() function is a simple workaround.
It is called with a file name and the new content.
If the new content differs from the existing content,
the file is updated.
Otherwise, the function simply returns.
This ensures that an updated time stamp always reflects changed content.
make
with this level of detail.
So, when a script finishes successfully,
it may generate a "flag file" to indicate the fact.
The time stamp on this file is then used as a proxy
for the time stamps of the "real" output files.
cron and make are commonplace,
as is the use of textual files for data interchange.
The only unusual aspects, really,
lie in the makefile generation technique
and the use of YAML as an encoding format for data structures.
Little Languages
dot, eqn, grap,
grep, lex, pic,
tbl, and yacc.
Although limited in scope,
they perform their (specialized) functions very well.
Unfortunately, creation of little languages generally requires the use
of tools such as lex and yacc.
By using YAML to handle the first-order parsing issues,
I was able to avoid this hassle
and create a number of declarative "little languages".
DDF entries (described above),
I created a language for generating data-flow animation sequences,
a couple of "mini-templating" systems, etc.
These languages are primarily used with hand-edited files,
where they dramatically reduce the amount of typing
(and in some cases, thinking).
e_cat/{cat,html}.yml".
In the case of DDF entries,
this is extended to create multiple path entries
(in the generated file_sets file)
for any patt entries containing brace expressions.
dot files.
The YAML file defines highlighting sequences, inter-sequence pauses, etc.
After reading this file, a script edits each dot file
into a sequence of modified files (one for each "frame").
These files are then turned into images
which can be concatenated into a QuickTime
movie.