Skip to main navigation Skip to search Skip to main content

SELFIES and the future of molecular string representations

  • Mario Krenn (Corresponding author)
  • , Qianxiang Ai
  • , Senja Barthel
  • , Nessa Carson
  • , Angelo Frei
  • , Nathan C. Frey
  • , Pascal Friederich
  • , Théophile Gaudin
  • , Alberto Alexander Gayle
  • , Kevin Maik Jablonka
  • , Rafael F. Lameiro
  • , Dominik Lemm
  • , Alston Lo
  • , Seyed Mohamad Moosavi
  • , José Manuel Nápoles-Duarte
  • , Akshat Kumar Nigam
  • , Robert Pollice
  • , Kohulan Rajan
  • , Ulrich Schatzschneider
  • , Philippe Schwaller
  • Marta Skreta, Berend Smit, Felix Strieth-Kalthoff, Chong Sun, Gary Tom, Guido Falk von Rudorff, Andrew Wang, Andrew D. White, Adamo Young, Rose Yu, Alán Aspuru-Guzik (Corresponding author)

Publications: Contribution to journalArticlePeer Reviewed

Abstract

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (SELFIES). SELFIES has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.
Original languageEnglish
Article number100588
Number of pages27
JournalPatterns
Volume3
Issue number10
DOIs
Publication statusPublished - 14 Oct 2022

Funding

The authors thank Greg Landrum, Daniel Flam-Shepherd, Suliman Sharif, and Bettina Lier for valuable comments on the manuscript. The authors also thank Sara Bebbington of IOP Publishing and Zamyla Chan and Erin Warner of the University of Toronto Acceleration Consortium for helping to organize the SELFIES workshop. M.K. acknowledges support from the FWF (Austrian Science Fund) via the Erwin Schrödinger fellowship no. J4309. R.F.L. received a PhD Scholarship from the São Paulo Research Foundation (FAPESP) – grant #2021/01633-3. This study was financed in part by CAPES – Finance Code 001. R.P. acknowledges funding through a Postdoc.Mobility fellowship by the Swiss National Science Foundation (SNSF; project no. 191127). A.W. would like to thank the Natural Sciences and Engineering Council of Canada (NSERC) for financial support via a CGS-M scholarship. G.T. acknowledges financial support from NSERC via the PGS-D scholarship. R.Y. acknowledges support from the US Department of Energy, Office of Science, AWS Machine Learning Research Award, and NSF grant #2037745. D.L. and G.F.v.R. were supported by the von Lilienfeld lab at the University of Vienna. A.D.W. was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM137966. K.M.J. and B.S. acknowledge funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 666983, MaGic). J.M.N.-D. acknowledges support by the National Council for Science and Technology (CONACYT) under award number CVU 105568. P.S. acknowledges support from the NCCR Catalysis (grant number 180544), a National Centre of Competence in Research funded by the Swiss National Science Foundation. S.M.M. was supported by the Swiss National Science Foundation (SNSF) under grant P2ELP2_195155. U.S. acknowledges support from the Deutsche Forschungsgemeinschaft (DFG) within NFDI4Chem (grant no. NFDI4-1). Q.A. acknowledges support from the National Science Foundation (grant no. DMR-1928882). A.A.G. acknowledges support from the Canada 150 Research Chairs Program, the Google Focused Award, and Dr. Anders G. Frøseth. The authors thank Greg Landrum, Daniel Flam-Shepherd, Suliman Sharif, and Bettina Lier for valuable comments on the manuscript. The authors also thank Sara Bebbington of IOP Publishing and Zamyla Chan and Erin Warner of the University of Toronto Acceleration Consortium for helping to organize the Selfies workshop. M.K. acknowledges support from the FWF (Austrian Science Fund) via the Erwin Schrödinger fellowship no. J4309 . R.F.L. received a PhD Scholarship from the São Paulo Research Foundation (FAPESP) – grant # 2021/01633-3 . This study was financed in part by CAPES – Finance Code 001 . R.P. acknowledges funding through a Postdoc.Mobility fellowship by the Swiss National Science Foundation (SNSF; project no. 191127 ). A.W. would like to thank the Natural Sciences and Engineering Council of Canada (NSERC) for financial support via a CGS-M scholarship. G.T. acknowledges financial support from NSERC via the PGS-D scholarship. R.Y. acknowledges support from the US Department of Energy , Office of Science, AWS Machine Learning Research Award, and NSF grant # 2037745 . D.L. and G.F.v.R. were supported by the von Lilienfeld lab at the University of Vienna . A.D.W. was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM137966 . K.M.J. and B.S. acknowledge funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 666983 , MaGic). J.M.N.-D. acknowledges support by the National Council for Science and Technology (CONACYT) under award number CVU 105568 . P.S. acknowledges support from the NCCR Catalysis (grant number 180544 ), a National Centre of Competence in Research funded by the Swiss National Science Foundation . S.M.M. was supported by the Swiss National Science Foundation (SNSF) under grant P2ELP2_195155 . U.S. acknowledges support from the Deutsche Forschungsgemeinschaft (DFG) within NFDI4Chem (grant no. NFDI4-1 ). Q.A. acknowledges support from the National Science Foundation (grant no. DMR-1928882 ). A.A.G. acknowledges support from the Canada 150 Research Chairs Program, the Google Focused Award , and Dr. Anders G. Frøseth.

Austrian Fields of Science 2012

  • 103006 Chemical physics
  • 102019 Machine learning

Keywords

  • DSML 3: Development/pre-production: Data science output has been rolled out/validated across multiple domains/problems

Fingerprint

Dive into the research topics of 'SELFIES and the future of molecular string representations'. Together they form a unique fingerprint.

Cite this