Third IEEE International Conference on Data Mining (ICDM'03)
Parsing Without a Grammar: Making Sense of Unknown File Formats
Melbourne, Florida
November 19-November 22
ISBN: 0-7695-1978-4
The thousands of specialized structured file formats in use today present a substantial barrier to freely exchanging information between applications programs. We consider the problem of deducing such basic features as the whitespace characters, bracketing delimiter symbols, and self-delimiter characters of a given file format from one or more example files. We demonstrate that for sufficiently large example files, we can typically identify the basic features of interest.