2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS) (2017)
Nov. 24, 2017 to Nov. 26, 2017
Yoichi Murakami , Department of Informatics, Tokyo University of Information Sciences, Chiba, Japan
Kenji Mizuguchi , Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and, Nutrition. Osaka, Japan
A better understanding of biological processes, pathways and functions requires reliable information about protein-protein interactions (PPIs). However, it is still a difficult task to identify complete PPI-networks experimentally in a cell or organism. To supplement the limitations of current experimental techniques, we have proposed PSOPIA, a computational method to predict whether two proteins interact or not ( [1]. The selection of datasets is a big issue for the PPI prediction [2, 3]. It is generally believed that increasing the size and diversity of examples makes the dataset more representative and reduces the noise effects; however, for many algorithms, it is impractical to use a large-scale dataset at the proteome level because of the memory and CPU time requirements. In this study, PSOPIA was retrained on a highly imbalanced large-scale dataset having a diverse set of examples at the proteome level. The dataset consisted of 43,060 high confidence direct physical PPIs obtained from TargetMine [4] (as positives being only 0.13% of the total) and 33,098,951 negative PPIs. As a result, the new prediction model achieved the higher AUC of 0.89 (pAUCfpr<o.5% = 0.24) than the previous model of PSOPIA. Furthermore, it was applied to the problem of filtering out protein pairs incorrectly determined as interacting (false positives) from a low-confidence human PPI dataset. Here, we suggest that a diverse set of large-scale examples is a key toward more reliable PPI prediction, demonstrating the performance of PSOPIA at the proteome level.
Proteins, Training, Reliability, Predictive models, Prediction algorithms, Computational modeling, Big Data

