TY - JOUR

T1 - Caveats to Deep Learning Approaches to RNA Secondary Structure Prediction

AU - Flamm, Christoph

AU - Wielach, Julia

AU - Wolfinger, Michael T

AU - Badelt, Stefan

AU - Lorenz, Ronny

AU - Hofacker, Ivo L

N1 - Copyright © 2022 Flamm , Wielach, Wolfinger, Badelt, Lorenz and Hofacker.

PY - 2022/7/11

Y1 - 2022/7/11

N2 - Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. The accuracy of these classical methods is limited due to lack of experimental parameters and certain simplifying assumptions and has seen little improvement over the last decade. This makes RNA folding an attractive target for machine learning and consequently several deep learning models have been proposed in recent years. However, for ML approaches to be competitive for de-novo structure prediction, the models must not just demonstrate good phenomenological fits, but be able to learn a (complex) biophysical model. In this contribution we discuss limitations of current approaches, in particular due to biases in the training data. Furthermore, we propose to study capabilities and limitations of ML models by first applying them on synthetic data (obtained from a simplified biophysical model) that can be generated in arbitrary amounts and where all biases can be controlled. We assume that a deep learning model that performs well on these synthetic, would also perform well on real data, and vice versa. We apply this idea by testing several ML models of varying complexity. Finally, we show that the best models are capable of capturing many, but not all, properties of RNA secondary structures. Most severely, the number of predicted base pairs scales quadratically with sequence length, even though a secondary structure can only accommodate a linear number of pairs.

AB - Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. The accuracy of these classical methods is limited due to lack of experimental parameters and certain simplifying assumptions and has seen little improvement over the last decade. This makes RNA folding an attractive target for machine learning and consequently several deep learning models have been proposed in recent years. However, for ML approaches to be competitive for de-novo structure prediction, the models must not just demonstrate good phenomenological fits, but be able to learn a (complex) biophysical model. In this contribution we discuss limitations of current approaches, in particular due to biases in the training data. Furthermore, we propose to study capabilities and limitations of ML models by first applying them on synthetic data (obtained from a simplified biophysical model) that can be generated in arbitrary amounts and where all biases can be controlled. We assume that a deep learning model that performs well on these synthetic, would also perform well on real data, and vice versa. We apply this idea by testing several ML models of varying complexity. Finally, we show that the best models are capable of capturing many, but not all, properties of RNA secondary structures. Most severely, the number of predicted base pairs scales quadratically with sequence length, even though a secondary structure can only accommodate a linear number of pairs.

KW - biophysical model

KW - dataset biases

KW - deep learning model

KW - folding prediction

KW - RNA secondary structure

UR - http://www.scopus.com/inward/record.url?scp=85163167670&partnerID=8YFLogxK

U2 - 10.3389/fbinf.2022.835422

DO - 10.3389/fbinf.2022.835422

M3 - Article

C2 - 36304289

VL - 2

JO - Frontiers in bioinformatics

JF - Frontiers in bioinformatics

SN - 2673-7647

M1 - 835422

ER -