This Article 
 Bibliographic References 
 Add to: 
Inferring Correlation Between Database Queries: Analysis of Protein Sequence Patterns
October 1993 (vol. 15 no. 10)
pp. 1030-1041

Given a subset P of a database, the problem of finding the query phi in a given database attribute having the closest extension to P is addressed. In the particular case that is outlined, P is the set of protein sequences in a protein sequence database matching a given protein sequence pattern, whereas phi is a query in the annotation of the database. Ideally, phi is the description of a biological function. If the extension of phi is very similar to P, an association between the pattern and the biological function described by the query may be inferred. An algorithm that efficiently searches the query space when negation is not considered is developed. Since the query language is a first-order language, the query space may be mapped into a set algebra in which a measure of stochastic dependence-an asymptotic approximation of the correlation coefficient-is used as a measure of set similarity. The algorithm uses the algebraic properties of such a measure to reduce the time required to search the query space. A prototype implementation of the algorithm has been tested in different collections of protein sequence patterns.

[1] K. Struhl, "Helix-turn-helix, zinc-finger, and leucine-zipper motifs for eukaryotic transcriptional regulatory proteins,"Trends Biochem. Sci., vol. 14, pp. 137-140, 1989.
[2] J. E. Walker, M. Saraste, M. J. Runswick, and N. J. Gay, "Distantly related sequences in theαandβ-subunits of ATP synthase, kinases and other ATP-requiring enzymes and a common nucleotide binding fold,"EMBO J., vol. 1, pp. 945-951, 1982.
[3] J. Figge and T. F. Smith, "Cell-division sequence motif,"Nature, vol. 334, p. 109, 1988.
[4] J. F. Moore, "Structural and functional patterns in protein and nucleic acid sequences," KeyBank 3.0., Intelligenetics Inc., 1988.
[5] A. Bairoch, "PROSITE: A dictionary of sites and patterns in proteins,"Nucleic Acids Res., vol. 19, pp. 2241-2245, 1991.
[6] R. Smith and T. F. Smith, "Automatic generation of diagnostic sequence patterns from sets of related protein sequences," inProc. Nat. Acad. Sci., 1990, pp. 118-122, vol. 87.
[7] G. Z. Hertz, G. W. Hartzell, and G. D. Stormo, "Identification of consensus patterns in unaligned DNA sequences known to be functionally related,"Comput. Applications Biosci., vol. 6, pp. 81-92, 1990.
[8] H. O. Smith, T. M. Annau, and S. Chandrasegaran, "Finding sequence motifs in groups of functionally related proteins,"Proc. Nat. Acad. Sci., 1990, pp. 826-830, vol. 87.
[9] M. Gribskov, M. McLachlan, and D. Eisenberg, "Profile analysis: Detection of distantly related proteins,"Proc. Nat. Acad. Sci., vol. 84, pp. 4355-4358, 1977.
[10] R. Guigó, A. Johansson, and T. F. Smith, "Automatic evaluation of protein sequence functional patterns,"Comput. Applications Biosci., vol. 7, pp. 309-315, 1991.
[11] K. Yamanishi and A. Konagaya, "Learning stochastic motifs from genetic sequences," inProc. Eighth Int. Workshop Machine Learning, 1991, pp. 467-471.
[12] A. Konagaya and K. Yamanishi, "Stochastic decision predicates: a new scheme to represent motifs," inProc. AAAI Workshop AI Molecular Bio., 1991.
[13] J. Rissanen, "Modeling by shortest data description,"Automatica, vol. 14, pp. 465-471, 1978.
[14] W. C. Barker, D. G. George, L. T. Hunt, and J. S. Garavelli, "The PIR protein sequence database,"Nucleic Acids Res., vol. 19, pp. 2231-2236, 1991.
[15] A. Bairoch and B. Boeckmann, "The SWISS-PROT protein sequence data bank,"Nucleic Acids Res., vol. 19, pp. 2247-2248, 1991.
[16] H. Crámer,Mathematical Methods of Statistics. Princeton, NJ: Princeton University Press, 1946.
[17] M. Vega, R. Guigó, and T. F. Smith, "Autoimmune response in AIDS?,"Nature, vol. 345, p. 26, 1990.
[18] L. G. Valiant, "Learning disjunctions of conjunctions," inProc. IJCAI- 85, 1985, pp. 560-566.
[19] A. Aho, B. Kernighan, and P. Weinberger,The Awk Programming Language. Reading, MA: Addison-Wesley, 1988.
[20] R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Eds.,Machine Learning: An Artificial Intelligence Approach, vol. 2. Los Altos, CA: Morgan Kaufmann, 1986.
[21] C. J. Date,Database Systems. Reading, MA: Addison-Wesley, 1987.
[22] J. D. Ullman,Database and Knowledge-base Systems. Rockville, MD: Computer Science Press, 1988.
[23] A. Hart, "Machine induction as a form of knowledge acquisition in knowledge engineering," inMachine Learning. New York: Chapman and Hall, 1989.
[24] R. Guigóand T. F. Smith, "A common pattern between TGF-βfamily and Glutaredoxin,"Biochem. J., vol. 280, pp. 833-834, 1991.
[25] R. H. Lathrop, T. F. Smith, T. A. Webster, and P. H. Winston, "ARIEL: A massively parallel symbolic learning assistant for protein structure/function," inArtificial Intelligence at MIT: Expanding Frontiers. Cambridge, MA: MIT Press, 1990.

Index Terms:
correlation inference; annotation query; protein sequence pattern analysis; stochastic dependence measurement; set similarity measure; database queries; protein sequence database; query language; first-order language; query space; set algebra; asymptotic approximation; correlation coefficient; algebra; biology computing; database theory; proteins; query processing; set theory
R. Guigó, T.F. Smith, "Inferring Correlation Between Database Queries: Analysis of Protein Sequence Patterns," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1030-1041, Oct. 1993, doi:10.1109/34.254060
Usage of this product signifies your acceptance of the Terms of Use.