Qianxiang
Ai
a,
Fanwang
Meng
a,
Jiale
Shi
a,
Brenden
Pelkie
b and
Connor W.
Coley
*a
aDepartment of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA. E-mail: ccoley@mit.edu
bDepartment of Chemical Engineering, University of Washington, Seattle, WA, USA
First published on 31st July 2024
The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.
As an information extraction task, structured data extraction from text can be considered as a combination of named entity recognition (NER) and relation extraction (RE) between named entities. Challenges in chemical NER include the pervasive usage of abbreviations and aliases, deviations from standard nomenclature, and the ambiguous boundaries between which a chemical entity is defined (e.g., when multiple words describe a single species).4,5 A variety of methods have been applied for chemical NER tasks. Rule-based or dictionary-based methods, such as LeadMine6 and ChemicalTagger,7 have been used to annotate reaction procedure texts or in the text parsing pipeline for constructing synthesis datasets such as SureCHEMBL,8 Pistachio,9 and ZeoSyn.10 While these algorithms are usually computationally efficient, the scope of rules and dictionary items limits their generalizability to new datasets. Various statistical model-based NER algorithms have also been proposed, often as a sequence labeling problem where the tokens in a sentence are assigned most likely tags based on token features. A popular strategy is the use of conditional random fields11 in combination with expert-selected features12 or contextualized word embeddings from neural networks (recurrent networks,13–15 or transformers16–19).
Traditionally, RE is formulated as a downstream task to NER and is solved as an ensemble of classification problems for entity pairs.20,21 More recent efforts aim to solve NER and RE simultaneously by building end-to-end models.22–25 This trend has persisted as pretrained large language models (LLMs) have become more accessible. LLMs have been used for NER/RE tasks in biomedicine,26 materials,27 and clinical trials,28 showing promise as tools for structured data extraction. For example, Dagdelen et al. developed a training pipeline for GPT-3 to extract information from scientific texts about crystalline materials as structured JSON29 and Walker et al. present an iterative scheme to fine-tune LLMs for extracting structured data of gold nanorods synthesis.30 Recent studies by Zhong et al. explored fine-tuned LLMs for reaction data extraction from literature in PDF format.31,32 The output of these models provides a reasonable coverage of reaction information, with the exception of quantity information. Pretrained LLMs can also be used for this task directly without fine-tuning. For example, a recent preprint by Patiny and Godin explores extracting analytical experiment results from literature solely through prompt engineering.33 While this method can extract structured data by including in-prompt data schema, it relies on closed-source LLMs and performs poorly when numerical values are involved.
One important use case for extracting structured reaction data is the production of procedural instructions to be used for reproducing experiments. For example, Vaucher et al. developed a transformer-based model to translate sentences of experiment procedures into action sequences.34 While these action sequences contain detailed information for execution, their evaluations focus more on the type of action than the parameters or objects of that action. SynthReader,35 a rule-based translator developed by Mehr et al.,35 converts natural language procedures to χDL, a data schema designed for chemical operations. Such a rule-based method, despite being computationally efficient, has to be expanded/modified to adapt to a different distribution, e.g., a change in writing style. Various submissions to Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab36–38 also aim to solve the NER/RE tasks including reaction/workup steps. Since these campaigns aim at evaluating individual NER/RE tasks, they do not constitute an end-to-end solution for structured data extraction into a specific output data schema.
In this study, we fine-tune an open-source large language model to extract structured reaction information from unstructured text from US patents (Fig. 1). To structure the desired outputs, we adopt the Open Reaction Database (ORD) data format, a comprehensive data schema tailored to organic reactions.40 The 100000-reaction dataset we use for fine-tuning is part of a collection originally published by Lowe in Chemical Markup Language (CML) format,39 so the fine-tuned model essentially pursues the same goal as Lowe's expert natural language processing pipeline, albeit using a different data schema. Extracted records cover information on reactants, products, conditions, and workup steps. We demonstrate that the fine-tuned model produces syntactically correct ORD records from the USPTO with an average accuracy of 91.25% for chemical messages (compounds, workups, conditions) and 92.25% for individual data fields. We also investigate its failure modes and evaluate performance on reaction role classification. We note that a preliminary version of this study was previously disclosed as part of a Perspective article on opportunities for LLMs in chemistry.42
Fig. 1 Overview of this study's approach to structured reaction data extraction from text. A 100k reaction subset of the United States Patent and Trademark Office (USPTO) reaction data39 as represented in the Open Reaction Database (ORD)40 is used to fine-tune and evaluate LLaMa-2-7B. An example of the structured ORD record is included in Section 2.1. The data pipeline (top left) is detailed in Section 2.2. The fine-tuning procedure is described in Section 2.3. The llama with a cap was generated using Craiyon AI.41 |
Fig. 2 (Top) The original text description of a reaction procedure and (bottom) example messages within the structured ORD reaction record.43 |
• Each of its ReactionInput messages has non-empty values for its components field. This usually means this reaction input is not the crude product of another reaction and that the chemical information of this reaction's inputs are present in reaction procedure text.
• The reaction includes an associated procedure text, i.e., the notes.procedure_details field of this reaction is a paragraph describing the reaction.
Reaction records satisfying these criteria were exported to JSON and deduplicated using OpenAI's data preparation tools (openai tools fine_tunes.prepare_data) to produce 1339260 unique records. The use of OpenAI's data preparation tools is free and was used here solely for convenient prompt deduplication. The procedure text and structured JSON are combined using a prompt template (see ESI†) modified from Stanford Alpaca.46 A sequence length limit of 2048 tokens based on LLaMA tokenizer, is imposed due to memory considerations in fine-tuning the language models. This sequence limit reduces the number of records to 1300613 (97.1%) of 1339260. The cumulative distribution function of sequence lengths is shown in Fig. S1.† A subset of 100K records, hereinafter referred to as USPTO-ORD-100K, is randomly selected from the 1300613 records. Unless otherwise specified, a random 8:1:1 train:validation:test split is applied to USPTO-ORD-100K to train/evaluate models throughout this study. This data pipeline is schematically shown in Fig. 1.
The information in a structured ORD record is not guaranteed to be a proper subset of its free text description, as some information in the structured ORD record is derived from elsewhere, and in this work denoted “implicit information”. For example, the reaction roles of compounds are rarely stated in a reaction's text description. As another example, the text description may indicate a filtration step (mapping to a ReactionWorkup of type FILTRATION in its ORD record) but does not include “filter” or “filtration” explicitly, e.g., “passing through celite”. We consider this kind of implicit information learnable and therefore do not exclude them from ORD records. On the other hand, some implicit information is considered unlearnable and thus excluded from the ORD records. Specifically,
• Unspecified outcome: if the name of a product is present in the ORD record and is not explicitly stated in the reaction text, this name is removed from the ORD record. This could happen when the product name is defined only in the title of the corresponding patent and not mentioned explicitly in the procedure text. This can also happen for reactants when they are referred to by compound identifiers or generic names.
• Calculated yield: if the yield value of a product is present in the ORD record and its integer value is not explicitly stated in the reaction text, this value is removed from the ORD record. This can occur when the calculated yield is different from the yield reported in the procedure text.
To avoid tuning the entire 7 billion parameters in LLaMA-2-7B, we adopt LLaMA-Adapter in our fine-tuning procedure.49 LLaMA-Adapter achieves parameter-efficient fine-tuning using learnable adaption prompts: for each of the topmost L transformer layers, a learnable prompt of length K is prepended to the (embedded) word tokens. This procedure reduces the total number of trainable parameters to K × L × C, where C is the token embedding dimensions, set to 4096 by default in LLaMA. Throughout this study, K = 10 and L = 30, giving 1.2 million trainable parameters that can fit in a GPU of 24 GB memory in half precision.
The train and validation datasets from the aforementioned random split are used for fine-tuning LLaMA-2-7B. The validation set is used to monitor the training process and to determine the number of training epochs with early stopping. Fine-tuning LLaMA-2-7B for 15 epochs with an initial learning rate of 7 × 10−5 was completed in approximately 70 hours using 2 NVIDIA RTX 4090 GPUs. In contrast, preparing the ORD datasets (in .pb.gz format) to obtain USPTO-ORD-100K took approximately 4 hours using our scripts with a 16-core 4.70 GHz CPU (Intel® i7-1260P). The average inference speed was roughly 37 token per second as estimated over 100 generations on one RTX 4090 GPU with batch size set to 1. This model is referred to as “the fine-tuned model” throughout this study. The hyperparameters for fine-tuning were not optimized.
Fig. 3 shows an example of Evaluation Metric 1 when comparing two ReactionInput messages given the message type of Compound messages. To distinguish the three failure modes, we first define a distance function for the given message type based on DeepDistance,50 an edit distance similar to Levenshtein distance designed for nested objects. When comparing two lists of messages (the shorter list is padded with empty messages such that two lists are of equal sizes), a bijective mapping between messages from two lists is found by minimizing the distance sum of all pairs, which is then used to identify the aforementioned failure modes.
Since a message always has a tree structure, we can also define evaluation tasks at the leaf level, where a leaf corresponds to an unstructured, literal field: Evaluation Metric 2. For a given message type, how many leaf fields of messages of this message type are accurately extracted or erroneously added, removed, or altered?
We note that Evaluation Metric 1 is defined at a lower granularity and is more stringent than Evaluation Metric 2, as summarized in Table 1. For example, in the case shown in Fig. 3, an entire compound message (blue) is marked as altered, while only two leaf fields (underscored) are considered as “Alteration” (value), and “Addition” (reaction_role), respectively. Assigning “Addition” and “Removal” to leaf fields also depends on the assignment at the message level, for example, when a message is assigned “Removal”, all of its leaf fields are assigned “Removal”.
Metric | Specific to a message type | Specific to a field type | What is being counted? | Granularity |
---|---|---|---|---|
1 | Yes | No | Added/removed/altered messages | Low |
2 | Yes | Yes | Added/removed/altered leaf fields | High |
It could be reasonable to use a numerical error measure to evaluate field-level extraction. This is because for certain downstream tasks, such as reaction condition recommendation, one could argue that mis-extracted fields containing floating point numbers will have a less deleterious effect on performance if they are close to the true value. However, we prefer the strict evaluation of exact-match accuracy for the information extraction task used here as sometimes missing or misplacing a number can happen more frequently than extracting a wrong number. This is reflected in an analysis on extracting reaction temperature values (ESI Section S7†).
Table 2 summarizes the evaluation results at the message level (Evaluation Metric 1). The fine-tuned model is able to extract compound information for ReactionInput entries reliably with an accuracy of 85.6%. Compared with missing compound information in ReactionInput (5.0%, failure mode “Removal”), it is relatively rare (2.3%) for the model to include excess compounds (failure mode “Addition”), and almost all of the excess compounds come from misplacement (e.g., a ProductCompound is placed in ReactionInput) instead of hallucination.
Message type | Path | Accurate | Removal | Addition | Alteration | Total |
---|---|---|---|---|---|---|
Compound | Inputs | 38470 (85.6%) | 2242 (5.0%) | 1015 (2.3%) | 4242 (9.4%) | 44954 |
41138* (91.5%) | 1574* (3.5%) | |||||
ProductCompound | Outcomes | 7450 (71.3%) | 345 (3.3%) | 58 (0.6%) | 2656 (25.4%) | 10451 |
9105* (87.1%) | 1001* (9.6%) | |||||
ReactionConditions | Conditions | 9524 (95.7%) | N/A | N/A | 433 (4.4%) | 9957 |
ReactionWorkup | Workups | 44165 (90.7%) | 1713 (3.5%) | 1719 (3.5%) | 2807 (5.8%) | 48685 |
Errors in extracting ProductCompound entries are more frequent, as indicated by a lower accuracy of 71.3%. Upon inspection, we noticed the errors mainly originate from implicit information: some fields of a ProductCompound message are not explicitly stated in the text description and are instead derived or inferred. One example is the “calculated” reaction yield, in contrast to the “reported” reaction yield which the model can capture successfully (Table S2†). To alleviate this effect, we also report the accuracy using a more lenient routine for identifying equivalent ProductCompound messages that considers two ProductCompound messages identical if all of their identifiers and amount fields are identical. These fields often capture all important chemical information about reaction outcomes. After applying this less strict equivalence definition, the accuracy for extracting ProductCompound messages increases from 71.3% to 87.1%, indicating that the model is capable of chemical entity/relation extraction even if it struggles with implicit calculation of yields. This routine also results in an increased accuracy (91.5%) for Compound messages in ReactionInput by excluding errors in reaction role classification (vide infra).
High accuracies of 95.7% and 90.7% are measured for ReactionConditions and ReactionWorkup, respectively. Since the ORD schema defines ReactionConditions as one single message rather than a list of messages, no “Addition” or “Removal” of this type of message is applicable.
To further understand how the fine-tuned model performs in extracting different types of chemical information, the completions are examined with finer granularity at the leaf level (Evaluation Metric 2), as shown in Table 3. The fine-tuned model shows excellent recognition capability for chemical entities such as compound identifiers (accuracy 93.5%) and amounts (95.2%), and it can infer reaction roles that are usually not explicitly stated in procedure texts (Section 3.3). Errors at the field-level mainly come from implicit information in ProductCompound messages, such as calculated yields (Table S1†).
Message type | Field type | Accurate | Removal | Addition | Alteration | Total |
---|---|---|---|---|---|---|
ProductCompound & Compound | Identifiers | 100958 (93.5%) | 5490 (5.1%) | 2590 (2.4%) | 1566 (1.5%) | 108014 |
Amount | 74209 (95.2%) | 3434 (4.4%) | 2182 (2.8%) | 300 (0.4%) | 77943 | |
Reaction role | 48262 (89.3%) | 2797 (5.2%) | 1264 (2.3%) | 2978 (5.5%) | 54037 | |
ReactionConditions | Condition | 26782 (98.3%) | 298 (1.1%) | 391 (1.4%) | 176 (0.7%) | 27256 |
ReactionWorkup | Workup | 178733 (94.0%) | 8360 (4.4%) | 10189 (5.4%) | 3156 (1.7%) | 190249 |
Other* | 31794 (84.80%) | 5261 (14.0%) | 2240 (6.0%) | 439 (1.2%) | 37494 |
As an alternate approach and point of comparison, we explored extracting structured data with pretrained LLMs directly using the chain-of-thought prompting method,52 a few-shot training method by engineering the prompts such that they mimic the thought processes of a human when solving a complicated task. This method is easier to deploy compared to the fine-tuning methods; however, it could only produce syntactically correct ORD data in 408 out of 500 cases after repairing with accuracies of 61.2% and 31.3% for Compound and ProductCompound, respectively, indicating that chain-of-thought prompting without fine-tuning is likely insufficient for this task. This prompting method is also limited by human-crafted instructions and the context window of the model, and, considering there are more than 600 different fields defined in ORD schema, preparing examples and steps to extract a full Reaction record seems impractical. Enabling JSON mode through OpenAI API in this process does not improve the model performance (Table S4†). Details of our implementation and evaluation can be found in ESI.†
Model | Accurate | Removal | Addition | Alteration | Total |
---|---|---|---|---|---|
Fine-tuned | 94.9% | 4.1% | 2.2% | 1.0% | 78408 |
ChemDataExtractor | 76.1% | 16.0% | 22.7% | 8.0% | |
MatSciBert | 96.6% | 2.2% | 2.4% | 1.2% |
We further test the fine-tuned model on uniproduct reactions from the ChemRxnExtractor16 dataset, a set of 123 records with labeled tokens for compound names. All records from this dataset were collected from individual literature passages. These passages can be considered an out-of-distribution challenge to our fine-tuned model: they tend to be defined by general chemical transformations (e.g., “oxidation of A gave B” or “cyclization of A afforded B”) instead of specific actions in synthesis procedures, chemical amount information is rarely present, and named entities in these passages are frequently represented by externally referencing tokens. As expected, the fine-tuned model performs poorly on this dataset, with an accuracy of 62.6% and a tendency to include unwanted tokens (Table S1†). Such a tendency often results from prioritizing chemical entities above referencing tokens. For example, in “by heating tryptophan methyl ester (9) at 140 °C for 3 h” the token “9” is the correct token to extract, while the fine-tuned model only recognizes “tryptophan methyl ester” which is a chemical entity in a more general sense. These results suggest the ChemRxnExtractor dataset differs significantly from USPTO-ORD-100K, which justifies fine-tuning the base LLaMA-2-7B model for the ChemRxnExtractor dataset. Unfortunately, the small size of the ChemRxnExtractor dataset makes it insufficient for fine-tuning and subsequent evaluation (ESI Section S2†).
Fig. 4A shows the confusion matrix of reaction role assignment from the fine-tuned model for all compounds in ReactionInput from the test dataset. The classification accuracy decreases from REACTANT to SOLVENT to CATALYST, with a tendency to mislabel SOLVENT or CATALYST as REACTANT, as expected based on class populations. Compared to extracting compounds of other roles (2.6% for REACTANT, 1.4% for SOLVENT), the model failed more frequently (4.2%) when extracting catalysts. Fig. 4B shows the results from the popularity baseline with similar accuracies for SOLVENT and CATALYST, and lower accuracy for REACTANT compared to the fine-tuned model. A macro-average F1 score of 86.1% is calculated for the fine-tuned model, while the popularity baseline gives 63.5%. For compounds whose reaction role in the dataset varies from reaction to reaction, the difference between the fine-tuned model (Fig. 4C) and the popularity baseline (Fig. 4D) becomes more pronounced: the former exhibits better performance for both REACTANT and CATALYST. These results suggest through fine-tuning the model learned to make role classifications based on reaction context.
Fig. 4 Confusion matrices of reaction role classification for the compounds in the test dataset using (A) the fine-tuned model and (B) the popularity baseline. The results for compounds whose role in the dataset varies from reaction to reaction are shown for (C) the fine-tuned model and (D) the baseline model. Percentage values were normalized using the number of true instances. In addition to three reaction role classes, prediction results can also be labeled as “MISSING” – when the corresponding compound is absent in the extracted ORD record, and “ERROR” – when the name of the extracted compound is incorrect. Note that because the reaction role classification depends on correct extraction of compound names, the first two rows of Fig. 4A and B share identical values. The same applies to the first two rows of Fig. 4C and D. |
As reaction data can include additional non-textual elements, such as reaction schemes and tables for reporting conditions/yields, multi-modality models will be needed to fully organize unstructured data. For reaction schemes, recent developments in the field of optical chemical structure recognition have enabled open-source tools to accurately capture chemical entities from raster images. Notable examples include MolScribe59 and RxnScribe60 developed by Barzilay and coworkers, as well as ReactionDataExtractor61,62 by Wilary and Cole. Table parsing/extraction tools have also been developed for chemistry literature, such as the table parsing module in ChemDataExtractor54 and OpticalTable-SQA,63 a fine-tuned question-answering language model for table extraction. As multimodal foundation models become increasingly available in fields beyond chemistry, it will be worth exploring their suitability for reaction data extraction.
The obvious use of the fine-tuned model is to support reaction data import to ORD with proper expert validation of the LLM-generated output. For example, as a postprocessing tool to convert unstructured ELN reports to structured data, or a reviewing/proofreading tool to expose as structured data what would otherwise be unsearchable, such as the procedure details buried in supplementary materials of a journal article. Tools presented in this study should contribute to answering the call for standardization in reaction informatics.1,64 As aligning reaction text with molecular representation has been demonstrated to be helpful in prediction tasks, the tool developed in this study could also serve as an auxiliary to inform reaction predictive models.65
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00091a |
This journal is © The Royal Society of Chemistry 2024 |