MBD: Case Study (FSW)

MBD Home

Overview

Concepts

Modeling

Semantic Wikis

The Extraction Phase

The Analysis Phase

The Presentation Phase

Case Study (FSW)

  Design Goals
  Implementation

Case Study (Unix)

Advice

Tools

Books


Rich Morin, rdm@cfcl.com

Printable Version

Due to spacecraft-related security concerns, the GLAST Flight Software (FSW) web site now requires a password. Given SLAC's historic policy of open research, we can hope that the concerns may eventually be handled in a less intrusive manner.

In the meanwhile, although I am able to perform in-person demos of the site (contact me for details), I am unable to give out the password. I apologize (and sympathize) with any difficulty this may cause.

In a recent contract, I was asked to create a comprehensive web site, providing both overview and detailed documentation for a scientific software development project:

    The FSW (Flight Software) group is creating software to operate the LAT (Large Area Telescope) instrument in the GLAST (Gamma-ray Large Area Space Telescope) satellite, process the data it collects, and output the resulting scientific information.

    - Introduction to Flight Software

The "production" FSW code is being written in C and assembler. It will be run on multiple, radiation-hardened PowerPC processors under the VxWorks operating system. However, development and test code must run in assorted environments, including Intel-based Linux, SPARC-based Solaris, etc.

Aside from a few dozen hand-written web pages (e.g., tutorials), the site's content is entirely computer-generated. About half of the pages are generated by Doxygen, a well-known documentation generator; the rest are generated by custom Perl scripts.

The input data comes from a variety of sources, including databases, XML files, (electronic and printed) documents, web pages, and assorted file formats (e.g., configuration files, object libraries).

Information can be requested of the development engineers, but there is no guarantee that they will have the time to reply. In short, a fairly typical mechanized documentation scenario.

Design Goals

A number of goals influenced the design:

  • development speed - It should be easy to try out new ideas, modify them, and/or bring them into production.

  • documentation - The code should be well structured and documented. Ideally, the same information should be used for both structure and documentation (self-documenting code is a Good Thing).

  • flexibility - It should be easy to add new data sources and sinks.

  • licensing - Open Source tools are strongly preferred, as the suite must be easy to share with other institutions.

  • maintenance - The suite should require little maintenance:

    • The hand-edited "knowledge base" should only need "touching up" when underlying information changes.

    • Machine-based harvesting should "Just Work", dealing gracefully with missing and/or inconsistent information.

    • The code should be able to survive (mostly) on its own.

  • portability - Other institutions (using suitably "Unixish" operating systems) should encounter minimal installation problems.

  • responsiveness - Web pages (even complex ones) should display quickly, even on a heavily-loaded server.

  • robustness - The code should not fail to produce reasonable output, even in the face of missing data, etc.

These goals are typical of many mechanized documentation projects. Equally typical, but worth noting, are some omissions:

  • efficiency - Barring issues affecting responsiveness, computer resources (e.g., processing, storage) are so cheap as to be inconsequential in most cases.

  • timeliness - The information need not be updated instantaneously. Once a day, in fact, turned out to be "just fine", though finer granularity would have been easy to provide.

Implementation

As discussed in the "Data Flow" section of the concepts page, MBD techniques can be implemented by means of a DAG (Directed Acyclic Graph) of processing routines. Each routine accepts one or more input data sets and, in traditional batch processing fashion, generates one or more output data sets. This produces a very modular system, because each routine interacts only with its input and output data sets.

This implementation consists of several dozen smallish Perl scripts, supplemented by assorted command-line utilities (e.g., dot, make, troff). Most of the scripts are under 500 lines in length, but a couple exceed 1000 lines. An ancillary library of perhaps 1500 lines supplies all of the "generic" code. Few of the routines, in any case, are difficult to comprehend.

YAML files are used for both the hand-edited and intermediate data sets. Because YAML is a simple and powerful data serialization format, each file can be a nicely-formatted textual representation of a loadable data structure. After loading the input structure(s), some scripts create "helper" data structures (e.g., additional indexes into the data).

All text files, whether hand-edited or machine-generated, are thoroughly commented. Each generated file receives an informative header, indicating the file's format, origins, purpose, etc. Section comments provide context and ease navigation.

Base Technologies

For a variety of (mostly pragmatic) reasons, the project uses Open Source tools whenever possible. The ability to modify code is important, as are cost factors and the ability to share applications with other institutions. In any case, many Open Source standbys (e.g., CMT, CVS, Doxygen, GCC, Graphviz, Groff, ImageMagick, Linux, MySQL, Perl, Python, Swish-e) are in use.

That said, some proprietary software is also being used. This includes database systems (e.g., Oracle), operating systems (e.g., Solaris, VxWorks, Windows), and applications (e.g., Adobe's Acrobat and Frame; Microsoft's Excel, Outlook, and Word). The decision to use proprietary software is generally based on either familiarity or the lack of an acceptable Open Source alternative.

Finally, quite a bit of the software infrastructure has been developed from scratch or adapted from Open Source tools. Aside from my own contributions, there are a couple of "test executives", an interactive packet specification application, etc. The group also uses a version of CMT which has been extensively modified to handle local requirements. For more information, see this introduction.

Data and Control Flow

The documentation suite consists of several dozen Perl scripts (~20K lines) and hand-edited YAML reference files (~30K lines). The suite is run early each morning, by means of cron and make. It produces (as needed) tens of thousands of files, in a variety of formats:

  • Intermediate data files are encoded in YAML.

  • Intermediate documents are encoded in whatever format is needed by the output formatting tools. (e.g., dot, grap, troff).

  • Output documents are encoded in web-friendly formats (e.g., HTML, PDF, and PNG) and served by Apache.

Several "tricks" are employed to ease maintenance, increase reliability, and optimize performance.

  • makefile generation

    Large, hand-edited makefiles are tedious and error-prone to edit. They are also difficult to document in an automated manner. To resolve this problem, I went to a two-stage process. A small, hand-edited makefile causes and controls the generation of the "real", machine-generated makefile.

    Each hand-edited file (whether data or script) contains a set of YAML "#DDF" (Documentation Data Flow) declarations, encoded as specially-formatted comments. These are harvested at the beginning of each run and used to generate both the production makefile and a set of data flow diagrams.

  • file_update()

    Some scripts (e.g., extraction routines) are always run, because there is no way to know whether their input has changed. Consequently, the content of their output files may be unchanged. Unfortunately, make just looks at file time stamps (not file content), so this could cause unnecessary processing.

    The file_update() function is a simple workaround. It is called with a file name and the new content. If the new content differs from the existing content, the file is updated. Otherwise, the function simply returns. This ensures that an updated time stamp always reflects changed content.

  • flag files

    Although a script may generate thousands of files, there is no reason to bother make with this level of detail. So, when a script finishes successfully, it may generate a "flag file" to indicate the fact. The time stamp on this file is then used as a proxy for the time stamps of the "real" output files.

The DDF declarations, collectively, provide an abstract model of the system's data flow. Specifically, programs and (sets) of data files are represented as nodes in a DAG. Connections between nodes (e.g., read or write access, "include" usage) are represented as edges. Each hand-edited file "knows" its relative path, description, label (for diagrams), and type. Scripts, in addition, know which files they use or create. Generated files are described by their originating scripts.

As odd as all this may sound, this is only a slight variation on a traditional Unix-style batch processing system. The use of cron and make are commonplace, as is the use of textual files for data interchange. The only unusual aspects, really, lie in the makefile generation technique and the use of YAML as an encoding format for data structures.

One interesting aspect of this implementation is that data structures are "first-class citizens" of the design process. Given that OOP (Object-oriented programming) techniques are based on hiding data structures, this may seem odd. However, this approach appears to provide a great deal of modularity, which is one of the major goals of the OOP approach.

It would be fairly trivial to convert this design into an event-based system. Given that the scripts are written in Perl, I would probably turn to POE (Perl Object Environment), which supports a very flexible approach to event-based programming. C++, Python, and Ruby have roughly equivalent systems, known respectively as ACE (Adaptive Communication Environment), Twisted, and dRuby (Distributed Ruby).

Although an event-based approach wouldn't need a generated makefile, it would still be necessary to have a setup script, in order to "program" the event distribution and check for cycles in the data flow. It would also be a good idea to use atomic file writes (e.g., write a temporary file, then rename it to the "final" name) to eliminate incomplete data transmissions.

Little Languages

The Unix community is rife with "little languages" such as dot, eqn, grap, grep, lex, pic, tbl, and yacc. Although limited in scope, they perform their (specialized) functions very well. Unfortunately, creation of little languages generally requires the use of tools such as lex and yacc. By using YAML to handle the first-order parsing issues, I was able to avoid this hassle and create a number of declarative "little languages".

Aside from the DDF entries (described above), I created a language for generating data-flow animation sequences, a couple of "mini-templating" systems, etc. These languages are primarily used with hand-edited files, where they dramatically reduce the amount of typing (and in some cases, thinking).

Most of these "languages", to be sure, consist of simple, special-purpose macro expansions. For example, I make frequent use of brace expansion (ala the shell) to handle expressions such as "e_cat/{cat,html}.yml". In the case of DDF entries, this is extended to create multiple path entries (in the generated file_sets file) for any patt entries containing brace expressions.

In the case of the data flow animation sequences, the "base" data flow diagrams are defined by hand-coded dot files. The YAML file defines highlighting sequences, inter-sequence pauses, etc. After reading this file, a script edits each dot file into a sequence of modified files (one for each "frame"). These files are then turned into images which can be concatenated into a QuickTime movie.

Even with the dense encoding that its little language affords, the animation specification is several hundred lines of intricate YAML. Without this compression, it would be ridiculously large, completely impractical to edit, and far more subject to error. In short, I believe that YAML-based little languages are a powerful addition to the MBD developer's repertoire.

Next: Case Study(Unix)