In recent years, rapid developments in genomics and proteomics have generated a large amount of data. Often, drawing conclusions from these data requires sophisticated computational analyses. Bioinformatics, or computational biology, is the interdisciplinary science of interpreting biological data using information technology and computer science. The importance of this new field of inquiry will grow as we continue to generate and integrate large quantities of genomic, proteomic, and other data.
A particularly active area of research in bioinformatics is the application and development of machine learning techniques to biological problems. Analyzing large biological data sets requires making sense of the data by inferring structure or generalizations from the data. Examples of this type of analysis include protein structure prediction, gene classification, cancer classification based on microarray data, clustering of gene expression data, statistical modeling of protein-protein interaction, etc. Each of these tasks can be framed as a problem in machine learning. We therefore see a great potential to increase the interaction between machine learning and bioinformatics.
This special issue is aimed at facilitating that interaction. We believe that machine learning can provide powerful tools for analyzing, predicting, and understanding data from emerging genomic and proteomic technologies. The papers submitted to this special issue provide strong evidence that this is the case.
In total, more than 50 papers were submitted to the special issue. After extensive reviews and revisions, 13 papers were accepted into the special issue, which will be published in two parts. The quality and significance of the accepted papers are very high and the papers cover a wide variety of topics. Below, we provide a summary of the papers published in this journal issue.
- In the context of gene expression analysis, effective attribute selection and classification methods have been a focus of study in recent years. In the paper "Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data" by Wai-Ho Au, Keith C.C. Chan, Andrew K.C. Wong, and Yang Wang, a new attribute clustering-based method for attribute selection and classification is proposed. The genes (attributes) clustered in the same groups are more correlated with one another, while the genes in different clusters are not. Top-ranked genes from each cluster are then chosen to build classifiers for prediction. Experiments show that this clustering-based gene selection and classification method predicts better (often much better) than using other attribute selection methods or using the entire gene set on two real-world cancer gene expression data sets.
- The paper "Joint Classification and Pairing of Human Chromosomes" by Pravesh Biyani, Xialoin Wu, and Abhijit Sinha addresses the problem of identifying and pairing human chromosomes from karyotype images. They propose heuristic methods for solving this NP-hard problem and demonstrate the validity of their approach using a large, clinical data set.
- In "Semisupervised Learning for Molecular Profiling," Cesare Furlanello, Maria Serafini, Stefano Merler, and Giuseppe Jurman explore a novel approach to identifying patterns in microarray expression data. They exploit the two-level cross-validation structure of the typical expression classifier experiment and show how the classifier validation runs in which a particular sample that is not included can be used to identify special, borderline, or outlier samples.
- The paper "Essential Latent Knowledge for Protein-Protein Interactions: Analysis by an Unsupervised Learning Approach" by Hiroshi Mamitsuka discusses a new probabilistic model for modeling protein-protein interactions. An EM-based algorithm is presented for learning the model.
- The paper "Markov Encoding for Detecting Signals in Genomic Sequences" by Jagath C. Rajapakse and Loi Sy Ho addresses how to use a Markov encoding of genomic sequences to enable a neural network to predict higher-order features in the sequences and has shown promising results.
- "The Latent Process Decomposition of cDNA Microarray Data Sets" by Simon Rogers, Mark Girolami, Colin Campbell, and Rainer Breitling presents a probabilistic clustering scheme for interpreting large expression data sets. Using a common set of latent variables, the method objectively assesses the number of classes in the data while allowing a given example to be assigned probabilistically to multiple classes.
- The paper "Fold Recognition by Predicted Alignment Accuracy" by Jinbo Xu deals with a template-ranking problem in a protein sequence using a support vector regression method which is shown to outperform the traditional Z-score-based method.
- The paper "Dimension Reduction-Based Penalized Logistic Regression for Cancer Classification Using Microarray Data" by Li Shen and Eng Chong Tan presents a method for dimensionality reduction using regression for cancer classification data. Dimensionality reduction is an important task in many bioinformatics problems due to the sheer size of the data.
Qiang Yang would like to thank Hong Kong RGC 6187/04E and CA03/04.EG01.HKBU2/03C for their support.