Issue 9, 2024

Extracting structured data from organic synthesis procedures using a fine-tuned large language model

Abstract

The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.

Graphical abstract: Extracting structured data from organic synthesis procedures using a fine-tuned large language model

Supplementary files

Article information

Article type
Paper
Submitted
06 Apr 2024
Accepted
30 Jul 2024
First published
31 Jul 2024
This article is Open Access
Creative Commons BY license

Digital Discovery, 2024,3, 1822-1831

Extracting structured data from organic synthesis procedures using a fine-tuned large language model

Q. Ai, F. Meng, J. Shi, B. Pelkie and C. W. Coley, Digital Discovery, 2024, 3, 1822 DOI: 10.1039/D4DD00091A

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements