The concepts presents a data flow "model" for an
MBD (Model-based Documentation) suite.
In this model,
the extraction phase is responsible
for accessing data sources,
selecting and organizing the desired data, etc.
Its output should be encoded
in a convenient and reliable representation for follow-on
analysis or
presentation.
This page looks at various kinds of data sources,
including system data and hand-edited files,
offering hints about how to extract data from them.
The question of how this data should be stored
is a subject for another page (available RSN :-),
but some general comments may be useful at this point.
MBD's extraction phase corresponds roughly to the first part of
data warehousing.
The "live" data gets collected, filtered, and saved in a manner
that eases follow-on analysis (e.g.,
data mining,
OLAP) and presentation.
The storage representation should handle structural issues,
retaining "interesting" structures from the input data
and allowing the addition of structures "discovered"
in the extraction or analysis phase.
The actual data will normally be stored
in (collections of) files and/or a relational database system.
In general, data access in the warehouse
should be far faster, easier, and more consistent
than it was in the "live" system.
If it isn't, you're doing something wrong (:-).
Computer-based systems maintain large amounts of data.
MBD systems can "harvest" this data,
extracting information on specific entities: facts, relationships, etc.
The resulting information can be used to generate detailed reports,
summary plots and diagrams, etc.
The first step, however, is data extraction.
The extraction code must access the incoming data,
parse it, reject noise, correct errors,
and pass the result along in a "convenient" format.
The specific tasks are defined by
(a) the format and content of the incoming data and
(b) the current and expected data needs.
The incoming data may be simple or complex in structure
and may arrive in any of a variety of manners
(e.g., files, APIs) and formats (e.g., binary, text).
The following sections cover some common cases.
Line-oriented "flat file" formats (e.g.,
CSV files,
Unix control files)
are usually easy to parse and understand.
Each line (including any "continuation" lines) is a record.
Individual fields can usually be extracted by
regular expressions;
special handling may be required for particularly complex formats.
If you have a lot of flat files to parse,
consider writing a parameterized input filter.
Declarative control files are far easier to maintain
than hand-crafted (and nearly identical) "input filters".
If your control files are cleanly formatted and well commented,
they can serve as documentation for the input files.
If a data source is used by more than one program,
it is likely to be available
via a well-documented data exchange mechanism.
For example, it might be a text file,
encoded in a documented dialect
of a standard data serialization format (e.g.,
XML,
YAML).
Alternatively, it might be offered via an
API (Application Programming Interface)
such as SQL or
Apple's
Core Data framework.
The exchange mechanism can be expected to handle the "syntax" level
(e.g., dividing the data into fields).
It may also help with structural or even semantic issues,
but there is no guarantee of this.
The good news is that you can often extract the needed data
without understanding the entire structure and semantics
of an arbitrarily complex data source.
Generally, use of an exchange mechanism
follows a well-trodden path:
study the documentation, decide what data you want,
use a query to extract it.
That said, there's nothing wrong
with speculative data collection:
if capturing some extra data is easy,
storing it can be a very worthwhile gamble.
It's also useful to look at the way the data source
is used by the system's own code.
A program's SQL statements and surrounding code,
for example, can provide quite a bit of information
on the purpose and interaction of database fields and tables.
If you get lost, look for a support infrastructure
(e.g., books, forums, mailing lists, web sites).
Asking authors and "resident experts",
you may be able to get answers to specific questions
(e.g., "What is this field used for?").
If you develop any documentation,
be sure to share it with the appropriate commmunity!
Wikipedia defines
"screen scraping" as:
Because the "display" was not intended for use by a program,
it may not provide simple (or even reliable) indications
of its structure.
After all, humans are much better than programs
at discerning structure from context, formatting, etc.
Fortunately, machine-generated documents tend
to have reasonably predictable structures.
Once you can handle the base layout and the common variations,
you're mostly done.
By coding for robustness (e.g., "print a diagnostic and continue"),
you can deal with new variations as they appear.
Of course, if the generating program gets changed
in a way that modifies the document structure,
your extraction program will "break".
So, it's a good idea to push for machine-friendly (e.g., XML) forms
of any documents you rely upon.
These problems aside,
screen scraping may involve assorted low-level formats:
"plain text" (e.g., log files,
nroff output),
HTML, PDF, PostScript, etc.
Access methods are often available for binary formats,
but you'll have to handle most text-based formats yourself.
Here are some hints...
If you only need a few items from a web page,
you may be able to extract them using special-purpose code.
For example,
if a particular snippet of HTML always appears in a line by itself,
or in a particular place in a table,
you may be able to recognize it and handle it as a special case.
Be aware that this "hack" may be brittle
in the face of even minor formatting variations.
If you need to extract a lot of data,
consider transforming the page into
XHTML,
a formalized dialect of
HTML.
Because XHTML files comply with XML syntax,
you can load them with your favorite XML-handling tools.
HTML Tidy will perform this conversion,
as well as cleaning up ugly (e.g., Microsoft Word) HTML.
Although the resulting document will be valid XML,
you'll still have some "detective work" to do.
Unlike XML that was generated for information exchange,
converted HTML is likely to have structural irregularities
(e.g., interleaved tags and text).
Worse, you won't have a schema to guide you
in figuring out the variations.
PostScript is a text-based "page description language".
Although it is actually an
imperative,
Turing-complete programming language,
this is seldom a real problem for parsing, etc.
RPN syntax aside,
most PostScript commands are used as formatting declarations.
So, it is generally possible
to examine a representative PostScript document,
determine what "idioms" are being used,
and write a specialized script to extract desired information.
An alternative approach uses that fact that PostScript files are,
in fact, programs.
By editing the PostScript code (and/or overriding selected operators),
it is possible to make a document log information about itself
(e.g., to standard output or a designated file).
If the idea of hacking PostScript doesn't appeal to you,
however, read on.
PDF (Portable Document Format)
is a binary,
declarative translation of PostScript.
Although a binary format may seem daunting,
there are libraries and other tools which can help in parsing PDF.
There are also reliable tools for PostScript translation.
So, you may want to convert all your incoming PostScript documents to PDF,
then parse them all with the same tool(s).
Here are some useful tools for dealing
with PDF and PostScript documents.
Although there is some overlap in their capabilities,
all three are well worth having.
Ghostscript is a powerful and flexible set of tools
for processing PDF and PostScript files.
It can be used to render documents, translate between formats, etc.
pdftk (PDF ToolKit) is a command-line tool for manipulating PDF files.
It performs number of specialized functions (e.g.,
applying watermarks,
encrypting and decrypting documents,
merging and splitting documents,
updating PDF metadata).
Although Xpdf is billed as a "PDF viewer" for the
X Window System,
it is far more than this.
The suite includes tools to extract images and text,
translate PDF to PostScript, etc.
Parts of Xpdf are used by other utilities,
such as search engines
(e.g., Swish-e)
and PDF viewers.
For more tools (and tricks), see
"PDF Hacks: 100 Industrial-Strength Tips & Tools"
(Sid Stewart; O'Reilly, 2004).
It's an easy read and will give you some ideas
about unconventional ways to use PDF.
If you get serious, you'll also want to get a copy of
Adobe's
"PDF Reference" (sixth edition).
It contains well over 1000 pages of definitive and detailed information.
There are dozens of books on PostScript,
ranging from introductions to reference manuals.
Again, the Adobe books are definitive,
but you may want to look at some others as a way to get started,
explore undocumented areas, etc.
The data you need may be stored in some arcane format.
Examples include binary libraries, spreadsheets,
word processing documents, and source code files.
Rather then researching
(or worse, reverse-engineering) the format,
look around for a library or command-line tool
that already knows how to parse it.
For example, you might want to extract linkage information
from binary library files.
The
Perl's
CPAN (Comprehensive Perl Archive Network)
has modules that import spreadsheets and many other oddball files.
Other scripting languages (e.g., Python, Ruby)
have similar online collections.
Finally,
you may be able to find a tool that reads the file(s)
and exports the data into XML or some other parsable format.
Doxygen, for example,
will peruse source code files in several languages,
dumping its accumulated knowledge as XML.
There are times when you'll need
to augment your machine-harvested data
with hand-edited information.
The supplementary data may have been collected from interviews,
copied and pasted from a PDF document,
or obtained in some other manner.
Regardless, the current objective
is to encode it for convenient use
by your documentation scripts.
The ideal file format for this purpose would be
flexible and powerful,
supported by an active user community,
easy to read and edit,
and a good "fit" for scripting languages
such as Perl, Python, and Ruby.
Flat files (e.g., CSV) fail the first test:
a two-dimensional array is neither flexible nor powerful.
Any format that you might cobble up on your own
fails the second test (user base).
XML fails the last two tests;
nobody but a masochist likes to edit XML
or traverse its data structures.
Fortunately,
YAML (YAML Ain't Markup Language)
meets all of these criteria quite handily:
Finally, I'll let you judge YAML's flexibility and readability
for yourself:
Assuming that this text was loaded
into a Perl data structure referenced by
The ability to create and edit data structures in one window,
then test them out in another, is incredibly seductive.
I have edited tens of thousands of lines of YAML,
using it to store a wide variety of data.
By post-processing the encoded text strings,
I have even created (declarative)
"little languages"
of various sorts,
using a YAML loader to handle the first-order parsing.
The FSW case study page
and this tutorial contain more information on YAML.
Next: Analysis
System Data
Flat Files
Exchange Mechanisms
Screen Scraping
... the act of capturing data from a system or program
by capturing and interpreting the contents of some display
that is not actually intended for data transport
or inspection by programs.
Web Pages
PostScript, PDF, etc.
Arcane Formats
nm command (and variations)
does a fine job of generating reports
on library and object files.
These reports have a regular format and are easy to parse.
Hand-edited Files
YAML
# This is a comment
abc:
- 123
- def: 'fed'
ghi: 'ihg'
$r,
the expression $r->{abc}[0]123.
The expression $r->{abc}[1]{def}'fed'.