loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05)
Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets
Minneapolis, Minnesota
October 19-October 21
ISBN: 0-7695-2476-1
Kaushik Sinha, Ohio State University
Xuan Zhang, Ohio State University
Ruoming Jin, Ohio State University
Gagan Agrawal, Ohio State University

One of the major problems in biological data integration is that many data sources are stored as flat-files, with a variety of different layouts. Integrating data from such sources can be an extremely time-consuming task.

We have been developing data mining techniques to help learn the layout of a dataset in a semi-automatic way. In this paper, we focus on the problem of identifying delimiters for optional fields. Since these fields do not occur in every record, frequency based methods are not able to identify the corresponding delimiters. We present a method which uses contrast analysis on the frequency of sequences to identify such delimiters and help complete the layout descriptions. We demonstrate the effectiveness of this technique using three flat-file biological datasets.

Citation:
Kaushik Sinha, Xuan Zhang, Ruoming Jin, Gagan Agrawal, "Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets," bibe, pp.177-184, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05), 2005
Usage of this product signifies your acceptance of the Terms of Use.