The Community for Technology Leaders
2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (2017)
Kansas City, MO, USA
Nov. 13, 2017 to Nov. 16, 2017
ISBN: 978-1-5090-3051-4
pp: 1227-1232
Yao-zhong Zhang , Institute of Medical Science, the University of Tokyo, Tokyo, Japan
Seiya Imoto , Institute of Medical Science, the University of Tokyo, Tokyo, Japan
Satoru Miyano , Institute of Medical Science, the University of Tokyo, Tokyo, Japan
Rui Yamaguchi , Institute of Medical Science, the University of Tokyo, Tokyo, Japan
ABSTRACT
Motivation: Next-generation sequencing (NGS) technologies using DNA, RNA, or methylation sequencing are prevailing tools used in modern genome research. For DNA sequencing, whole genome sequencing (WGS) and whole exome sequencing (WES) are two typical applications with a different preference on the trade-off between sequencing depth and base coverage. Although sequencing costs have been greatly reduced, the sequence depth used in WGS is relatively lower than WES (e.g., ∼35× vs. 100×∼). In addition, biases and batch effects may exist in different stages of a NGS experiment. Using low-depth and biased WGS data for downstream analyses is more sensitive to the bias problem and makes it even more difficult to uncover real biological signals in the data. In this work, we focused on reconstructing high read-depth signals from low-depth WGS data. We make use of a pair of WGS data with different read-depth for the same sample and learn a mapping from low-depth signals to high-depth in the given platform. Results: We explored three different reconstruction models from shallow to deep. Our experimental results show that by only using the read depth information, deeper models do not perform far better than a linear regression model. Through incorporating additional information, such as GC-content, mappability and nucleotide sequence information, the performance of convolutional neural network (CNN) models can be further improved. We made use of the reconstructed read-depth signals in downstream analysis to identify copy number variation segments for single sample. The experiment results show that segments that are not detected using low-depth data, can be detected with the reconstructed signals by the CNN model using extra biological information. Availability: The source code will be available at https://github.com/yaozhong/DLRec
INDEX TERMS
Bioinformatics, Sequential analysis, Genomics, Biological system modeling, Data models, Training, Machine learning
CITATION

Y. Zhang, S. Imoto, S. Miyano and R. Yamaguchi, "Reconstruction of high read-depth signals from low-depth whole genome sequencing data using deep learning," 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Kansas City, MO, USA, 2017, pp. 1227-1232.
doi:10.1109/BIBM.2017.8217832
403 ms
(Ver 3.3 (11022016))