DOI: 10.1039/D4TA04665J
(Paper)
J. Mater. Chem. A, 2024, Advance Article

Long-Fei Lv^{a},
Cai-Rong Zhang*^{a},
Rui Cao^{a},
Xiao-Meng Liu^{a},
Mei-Ling Zhang^{a},
Ji-Jun Gong^{a},
Zi-Jiang Liu^{b},
You-Zhi Wu^{c} and
Hong-Shan Chen^{d}
^{a}Department of Applied Physics, Lanzhou University of Technology, Lanzhou, Gansu 730050, China. E-mail: zhcrxy@lut.edu.cn
^{b}School of Mathematics and Physics, Lanzhou Jiaotong University, Lanzhou 730070, China
^{c}School of Materials Science and Engineering, Lanzhou University of Technology, Lanzhou, Gansu 730050, China
^{d}College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou, Gansu 730070, China

Received
5th July 2024
, Accepted 31st July 2024

First published on 2nd August 2024

In organic solar cells (OSCs), electron donor–acceptor materials are key factors influencing device performance. However, traditional experimental methods for developing new, high-performance materials are often time-consuming, costly and inefficient. To accelerate the development of novel OSC donor–acceptor materials, we constructed a database of 547 donor–acceptor pairs and derived 30 easily obtainable molecular structure descriptors through transformation screening. Using the long short-term memory (LSTM) network model, belonging to deep learning, we tuned the LSTM model with grid search for optimal hyperparameters, and predicted the power conversion efficiency (PCE), open-circuit voltage, short-circuit current density and fill factor. The SHapley Additive exPlanations analysis revealed that the number of rotatable bonds and the presence of two or more rings in acceptor molecules positively impact PCE. We then systematically fragmented and recombined molecules in the constructed database, creating 142560 donor molecules and 61732 acceptor molecules. The tuned LSTM model predicted photovoltaic parameters for these new donor–acceptor pairs. After excluding the donor–acceptor pairs in the database, we identified 7632 novel pairs with a predicted PCE greater than 18.00%, including five pairs exceeding 18.50%, with the maximum PCE of 18.52%. This method facilitates the cost-effective design and rapid, accurate prediction of OSC material performance, enabling efficient screening of high-performance candidates.

Traditional fullerene materials, such as PC_{70}BM, PC_{71}BM, and C_{60}, had achieved relatively high power conversion efficiencies (PCEs), making them the mainstream acceptor materials in the OSC field.^{9–13} However, the synthesis of fullerene materials is costly, and their electronic structure leads to poor light absorption in the UV-vis region. This limits their light harvesting efficiency and photovoltaic performance, thereby restricting the further development of fullerenes in OSCs. In contrast, non-fullerene acceptors (NFAs) exhibit broader absorption spectra and more easily tunable energy levels, as well as narrower optical band gaps and greater carrier mobility, which are beneficial for improving OSC performance.^{14–17} Therefore, OSCs using NFAs are considered to have a very promising application prospect.^{18–20}

In recent years, OSCs developed rapidly, with significant improvements in PCE. The PCE of binary or ternary OSCs using NFAs reached 19%,^{21–24} and the PCE of tandem OSCs exceeded 20%.^{25} Layer-by-layer OSCs have experienced significant advancements in recent years.^{26–29} However, since the PCE of OSCs is still relatively low compared to that of currently commercialized silicon-based and perovskite solar cells, improving the PCE of OSCs remains the primary research goal.

Designing new donor and acceptor materials, particularly those with high PCE, using traditional experimental methods is very challenging. Due to the complexity of chemical composition, conventional methods are time-consuming and labor-intensive. Consequently, scientists have proposed using machine learning to accelerate molecular design.^{30–34} Researchers have utilized machine learning algorithms to analyze a series of performance data from OSCs and discovered that optimizing certain key descriptors can improve the accuracy of prediction models.^{35–37} Sahu et al. introduced methods for predicting the PCE of OSCs using machine learning and the improved descriptors.^{38–40} Han and Yi proposed the singlet–triplet energy gap (ΔE_{ST}) as a key molecular descriptor for predicting PCE, achieving a Pearson correlation coefficient (r) of 0.81 in their predictions.^{41} Saeki and Nagasawa screened conjugated molecules for polymer-fullerene OSC applications through supervised learning methods.^{42} Sun et al. used a database containing actual donor materials collected from the literature, and employed images, ASCII strings, two types of descriptors and seven molecular fingerprints as inputs for machine learning models to predict PCE.^{43} David et al. proposed a machine learning method for extracting data information from OSCs, utilizing a database composed of 1850 device characteristics, performance, and stability data, and employed the Sequential Minimal Optimization Regression (SMOreg) model to identify the factors that have the greatest impact on OSC stability and PCE.^{44} Min et al. applied machine learning analysis to find the optimal donor–acceptor pairs for OSCs. They predicted PCE using five machine learning models—Linear Regression (LR), Multiple Logistic Regression (MLR), Random Forest (RF), Artificial Neural Network (ANN), and Boosted Regression Tree (BRT)—on a dataset of 565 polymer donor non-fullerene acceptor OSC pairs, achieving an r of 0.71 and 0.70 for the BRT and RF models, respectively.^{45}

In recent years, deep learning, as a branch of machine learning, has developed rapidly, achieving remarkable results in natural language processing, image recognition, handling complex data and fitting intricate functions.^{46–52} It has been applied in the field of OSCs as well. Peng and Zhao used convolutional neural networks (CNNs), widely applied in deep learning, to build a model that used molecular simplified molecular input line entry system (SMILES) strings as inputs to predict PCE^{53} and the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) energy levels, and to generate new NFA molecules.^{54} They later used CNNs to build another model that used molecular graphs as inputs to predict the HOMO and LUMO energy levels of new molecules.^{55} Moore developed the quantitative structure–property relationship based on a deep learning model in the form of a CNN to predict the HOMO and LUMO energy levels of organic molecules usable in OSCs. The model used the SMILES strings of molecules as inputs, converted them into 2D RGB images, extracted features from the images using the network's convolutional layers, and then used a deep dense neural network to convert the features into energy levels.^{56}

Long short-term memory (LSTM) networks, as one of the important methods in the field of deep learning, are a special type of recurrent neural network (RNN) designed to address the gradient vanishing and exploding problems that a standard RNN may encounter during learning.^{57} By introducing a “gate” mechanism (including an input gate, forget gate and output gate) and memory cells, LSTM can effectively control the input, retention and output of information. In previous studies, some descriptors used for predicting photovoltaic performance parameters through machine learning methods were too costly to compute for high-throughput screening, and the amount of data required for deep learning methods was excessively large. To address this issue, this research uses an LSTM-based deep learning prediction model with easily accessible molecular structure descriptors and a relatively small database. With the aid of the LSTM model, it is possible to simulate and evaluate material performance in a virtual environment, significantly reducing experimental costs and cycles, and accelerating the discovery and application of novel OSC materials.

In the process of constructing an LSTM-based deep learning prediction model to accelerate the discovery of novel OSC materials, relying solely on the predictive capability of the model is often insufficient. To make the model's decision-making process more transparent and to enhance its application value, the SHapley Additive exPlanations (SHAP) analysis method is employed to identify and interpret the importance of various structural descriptors within the model.^{58–63} SHAP is a model interpretation method developed based on the Shapley value from game theory. The Shapley value is a mathematical concept used to quantify each player's marginal contribution to the overall success in a cooperative game. In machine learning models, each feature (such as the structural descriptors of materials) can be viewed as a “player,” and the model's predictive outcome is analogous to the “overall success” of the game. By calculating the Shapley value for each feature, it is possible to quantify the contribution of that feature to the model's predictive outcome, thereby understanding its importance. The use of SHAP analysis not only provides interpretability for the LSTM model but, more importantly, reveals the impact of different structural descriptors on the performance of OSC materials. This is highly valuable for scientists and researchers. For instance, if the analysis shows that a particular molecular structural feature has a significantly positive impact on the PCE of OSCs, this feature can be prioritized in future material design and optimization, thereby more efficiently screening and discovering high-performance OSC materials.

To find high-performance novel OSC donor–acceptor materials, in this work, 547 completely different donor–acceptor pair molecular structures and the corresponding OSC performance parameters were collected. The collected molecular structures were converted into structural descriptors, which were screened and used as inputs for the LSTM model to predict OSC performance parameters. After tuning the hyperparameters, a model with good predictive performance was obtained, and the importance of the input descriptors was interpreted using the SHAP analysis method. Next, molecular design and virtual screening were conducted. The molecules in the database were systematically fragmented to create a fragment library. These fragments were then recombined to generate new OSC donor–acceptor materials. The tuned model was used to predict PCE, open-circuit voltage (V_{OC}), short-circuit current density (J_{SC}), and fill factor (FF) of these new OSC materials. Finally, high-performance novel OSC donor–acceptor materials were screened out.

Not all molecule descriptors and the corresponding device performance parameters could be used to build the database, so data preprocessing was required. Since some generated structural descriptor values were zero, including them would significantly affect the predictive performance of the LSTM model. If zero values of a structural descriptor are more than 80% in the database, the corresponding structural descriptor was removed, resulting in 51 structural descriptors with 27 related to acceptors and 24 related to donors. Excessive redundant descriptors could introduce noise and affect the LSTM model's predictive performance. To minimize this impact, descriptors for donor and acceptor molecules were individually screened, and their correlations with PCE were analyzed using Pearson correlation coefficients (r). If several descriptors had r values greater than 0.9, only the descriptor with the highest r value related to PCE was retained. Totally, 30 structural descriptors were obtained, with 19 related to acceptors and 11 related to donors. The meanings of the final 30 descriptors are shown in Tables S1 and S2 in the ESI.† After obtaining the structural descriptors, some molecular pairs had different structures but identical descriptor values due to molecular similarity. In such cases, only the data with the highest PCE were retained, resulting in a final database of 465 donor–acceptor pairs.

The forget gate f_{t} is a sigmoid function with the previous cell's hidden output h_{t−1} and the current cell's input x_{t} as inputs, generating a value between [0,1] (which can be considered a probability) for each item in the previous cell's memory state C_{t−1} to control the extent of forgetting the previous cell's state, as shown in eqn (1):

f_{t} = σ(W_{f}[h_{t−1}, x_{t}] + b_{f})
| (1) |

i_{t} = σ(W_{i}[h_{t−1}, x_{t}] + b_{i})
| (2) |

(3) |

(4) |

The output gate o_{t} controls how much of the current cell state is filtered. The cell state is activated, and the output gate generates a value between [0,1] for each item, controlling the degree of filtering the cell state, as shown in eqn (5) and (6):

o_{t} = σ(W_{o}[h_{t−1}, x_{t}] + b_{o})
| (5) |

h_{t} = o_{t} × tanh(C_{t})
| (6) |

In the model used in this study, a gated linear unit (glu) layer was defined,^{69} and an LSTM model was created, which included LSTM layers, glu layers, dropout layers, and linear layers. The parameters and hyperparameters of the model were set, the loss function and optimizer were defined, with mean squared error (MSE) used as the loss function and the Adam optimizer. Early stopping was employed to monitor the training process and evaluate the model's performance on the validation set to check for improvements. If improvement was not observed within a specified number of training iterations, the training was terminated early. During the training loop, the model underwent forward propagation, loss calculation, backpropagation, and optimization.

A detailed network structure diagram of the model used to predict PCE is provided in Fig. 2, which illustrates the backpropagation process of the LSTM. The main elements include network parameters (weights and biases), computation nodes, and the gradient accumulation process. The detailed explanations of each element are as follows. The lstm.weight_hh_l0, lstm.bias_hh_10, and lstm.weight_ih_l0 are the parameters of the LSTM layer, and their shapes are (600, 150), (600), and (600, 30), respectively, indicating the dimensions of the parameter matrices. AccumulateGrad: this is a gradient accumulation node, indicating that the gradients corresponding to the parameters are incrementally accumulated during backpropagation. glu.linear1.weight and glu.linear1.bias: these are the weights and biases of the first fully connected linear layer, with shapes (150, 150) and (150), respectively. CudnnRnnBackward: this represents the gradient calculation of the RNN layer implemented using the cuDNN library in CUDA. SelectBackward and TBackward: these are backpropagation nodes for the select and transpose operations. AddmmBackward and SigmoidBackward: these represent the gradient calculation nodes for matrix multiplication and the sigmoid activation function. fc.weight and fc.bias: these are the weights and biases of the final fully connected layer (fc), with shapes (1, 150) and (1), respectively. MulBackward: this represents the backpropagation of the multiplication operation. NativeDropoutBackward: this represents the backpropagation of the dropout regularization layer. Ultimately, all these computations and gradient accumulations converge to a final output.

For hyperparameter tuning, the grid search method was used to adjust the optimal hyperparameters, tuning the hyperparameters separately for each of the four device performance parameters. After identifying the optimal model, the model was evaluated using the MSE, the root mean squared error (RMSE), the mean absolute error (MAE), the Pearson correlation coefficient (r), and the coefficient of determination (R^{2}), and defined as follows:

(7) |

(8) |

(9) |

(10) |

(11) |

Eqn (7)–(11) define the evaluation metrics, where N is the number of data points in the dataset; R_{i} and P_{i} represent the actual values and predicted values, respectively; and represent the mean of the actual values and predicted values, respectively; and var(R_{i}) is the variance of the sample data. These metrics are used to discuss the accuracy of the trained models in predicting the performance of OSC devices.

All model building and training were performed using PyTorch,^{70} which is a Python-based scientific computing package primarily designed to meet the needs of deep learning. It is one of the most popular tools in the field of deep learning. PyTorch offers a rich library of deep learning algorithms and a flexible design mechanism, supporting features such as automatic differentiation, dynamic computation graphs, and model visualization, enabling users to build and train models more easily and efficiently. All programming and execution were completed using PyTorch 1.12 and RDKit 2023.03.2 within the Python 3.9 environment on the Anaconda platform.

For donor molecules, 36 D, 22 π, and 33 A were obtained. For acceptor molecules, 44 D, 23 π, and 61 A were obtained. In recent years, researchers have been dedicated to exploring new molecular design and synthesis methods to improve the PCE and stability of OSCs. Among them, donor molecules with a D–π–A–π structure have been widely used in OSCs due to their broad absorption range, which helps in exciton dissociation and reduces electron and hole recombination, thereby improving PCE and charge carrier mobility.^{71} On the other hand, acceptor molecules with an A–π–D–π–A structure exhibit high design flexibility, allowing the tuning of optical absorption properties and energy levels by modifying the chemical structure, thereby optimizing device performance.^{72} After segmentation, donor molecules were combined according to the D–π–A–π format, resulting in 142560 donor molecules, while acceptor molecules were symmetrically combined according to the A–π–D–π–A format, resulting in 61732 symmetrical acceptor molecules. This design approach allows systematic exploration and generation of a large number of potential novel OSC materials.

Heat maps analyzing the descriptors of input models for acceptor and donor molecules are shown in Fig. 4 and 5, respectively. In the heat maps, the color intensity represents the strength of the correlation, with red indicating a positive correlation, blue indicating a negative correlation, and deeper colors representing stronger correlations. The correlation coefficient values range from −1 (completely negative correlation) to +1 (completely positive correlation). In Heat Map_A, the correlation between acceptor descriptors and organic photovoltaic performance parameters is illustrated. The number of halogen groups in the acceptor (fr_halogen_A), the number of two or more rings in the acceptor (fr_bicyclic_A), and the number of ketone groups in the acceptor (fr_ketone_A) show strong positive correlations with PCE, and also notable positive correlations with J_{SC} and FF, suggesting that these molecular features may significantly impact photovoltaic performance. In Heat Map_D, the correlation between donor descriptors and photovoltaic performance parameters is depicted. PCE and J_{SC} have the strongest positive correlations with the number of rings contained in the donor molecule (RingCount_D). V_{OC} does not show significant correlation with any donor or acceptor descriptors, with the highest correlation descriptor being the number of rings in the donor molecule (RingCount_D).

Fig. 4 Heatmap of acceptor descriptors and their correlations with PCE, V_{OC}, J_{SC}, and FF in the database. |

Fig. 5 Heatmap of donor descriptors and their correlations with PCE, V_{OC}, J_{SC}, and FF in the database. |

After dividing the database into the training set (374 data points) and test set (91 data points), the models were input to predict and test the four device performance parameters: PCE, V_{OC}, J_{SC}, and FF. Using early stopping and grid search, the optimal model was calculated to avoid overfitting, resulting in a model with the best predictive performance and generalization ability. The hyperparameters of the LSTM model tuned for PCE, V_{OC}, J_{SC}, and FF are shown in Table S3.† The MSE, RMSE, MAE, r, and R^{2} of the tuned LSTM model for these parameters are shown in Table 1. A smaller MSE value indicates higher prediction accuracy of the model. Since RMSE shares the same units as the prediction target, its results are easier to interpret. MAE is used to calculate the average absolute difference between predicted and actual values. Compared to MSE or RMSE, MAE is less sensitive to outliers because it does not square the errors, thus reducing the impact of outliers on the overall error. MAE provides an intuitive understanding of the magnitude of prediction errors, with smaller values indicating more accurate predictions. The correlation coefficient r is used to measure the strength and direction of the linear relationship between two variables. In regression tasks, it can be used to assess the degree of correlation between predicted and actual values, ranging from −1 to 1, with values close to 1 or −1 indicating strong correlation and values close to 0 indicating no correlation. R^{2} reflects the goodness of fit of the model predictions to actual values, crucial in regression models. It is calculated based on the ratio of prediction error to the variance of the original data and can be interpreted as the proportion of the variance explained by the model. R^{2} ranges from 0 to 1, with values closer to 1 indicating higher explanatory power and better predictive performance. For PCE, the high r values of 0.9446 in the training set and 0.9179 in the test set indicate a strong correlation between observed and predicted PCE values, with R^{2} values of 0.8916 in the training set and 0.8414 in the test set indicating good model accuracy. Low values of RMSE, MAE, and MSE further demonstrate the model's excellent precision. For J_{SC}, the accuracy is similar to that of PCE, although there is a slight drop in precision, yet it still shows excellent predictive capability. The predictive ability for FF and V_{OC} is lower than that for PCE and J_{SC} but still performs well. Compared to previous work by other researchers, who used the RF model to predict PCE with an R^{2} level close to 0.7 and an r level around 0.8,^{36,37} this study demonstrates superior results.

Device parameters | Evaluation metrics | Training set value | Test set value |
---|---|---|---|

PCE | r | 0.9446 | 0.9179 |

R^{2} |
0.8916 | 0.8414 | |

RMSE | 1.4815 | 1.8105 | |

MAE | 1.0434 | 1.4189 | |

MSE | 2.1949 | 3.2778 | |

J_{SC} |
r | 0.9438 | 0.9040 |

R^{2} |
0.8885 | 0.8138 | |

RMSE | 2.2724 | 3.0389 | |

MAE | 1.6955 | 2.2719 | |

MSE | 5.1639 | 9.2346 | |

V_{OC} |
r | 0.7190 | 0.7239 |

R^{2} |
0.5108 | 0.5159 | |

RMSE | 0.0903 | 0.0981 | |

MAE | 0.0625 | 0.0745 | |

MSE | 0.0082 | 0.0096 | |

FF | r | 0.7949 | 0.7801 |

R^{2} |
0.6235 | 0.5937 | |

RMSE (in%) | 7.9367 | 8.4015 | |

MAE (in%) | 6.0239 | 6.5947 | |

MSE (in%) | 58.3198 | 70.5858 |

As shown in the scatter plots in Fig. 6, the relationship between experimental values and predicted values in the training and test sets can be visually compared. Blue triangles represent data points from the training set, while red circles represent data points from the test set. The x-axis denotes experimental values, and the y-axis denotes predicted values. Additionally, fitted lines shown in blue and red illustrate the fit between predicted and experimental values for the training and test sets, respectively. If the points tend to fall along a straight line with a slope close to 1, it indicates accurate predictions, visually representing the model's performance on these datasets. The scatter plots also label the R^{2} and r values for both the training and test sets, which are crucial for evaluating the model's overall performance. This approach not only quantitatively evaluates the model's performance on the datasets but also provides a visual understanding of the relationship between predicted and experimental values. The close alignment of training and test set performance across the four plots indicates that the model does not overfit and generalizes well to new data. The high R^{2} and r values for all four parameters suggest that the model effectively captures the relationship between predicted and actual values, with closely clustered points around the best-fit line, particularly in the PCE and J_{SC} plots, demonstrating the model's strong predictive capability.

D:A | Experimental PCE (%) | Predictive PCE (%) | Absolute error (%) |
---|---|---|---|

PM6:L8-BO | 18.50 | 16.88 | 1.62 |

PB[N][F]:Y6 | 14.10 | 13.29 | 0.81 |

PM6:Y18 | 16.02 | 16.28 | 0.26 |

PTB7-Th:DTC-F-F | 7.53 | 6.67 | 0.86 |

PBDB-T:sp-mOEh-ITIC | 6.44 | 6.40 | 0.04 |

To validate the model's generalization ability, five reported donor–acceptor pairs outside the database were selected as shown in Table 3, including D18:L8-BO,^{81} PTQ10:ITIC-4F,^{82} PTB7-Th:Y6,^{83} PM6:ID-C6Ph-4F^{84} and PffBT4T-2OD:P(4CF8CH-PDI-TT).^{85} The PCE prediction model was used to predict these donor–acceptor pairs, and the absolute errors were 0.94%, 0.73%, 1.73%, 0.48%, and 0.67%, respectively, indicating that the trained model has good generalization ability. The prediction results of the V_{OC}, J_{SC}, and FF for the five donor–acceptor pairs both inside and outside the database are provided in Tables S4 and S5.† The validation results indicate that the trained model has high accuracy in predicting the performance parameters of OSC devices and exhibits good generalization capability.

D:A | Experimental PCE (%) | Predictive PCE (%) | Absolute error (%) |
---|---|---|---|

D18:L8-BO | 16.30 | 15.36 | 0.94 |

PTQ10:ITIC-4F | 11.25 | 11.98 | 0.73 |

PTB7-Th:Y6 | 11.00 | 12.73 | 1.73 |

PM6:ID-C6Ph-4F | 10.75 | 10.27 | 0.48 |

PffBT4T-2OD:P(4CF8CH-PDI-TT) | 3.43 | 4.10 | 0.67 |

Fig. 7 SHAP importance analysis of the 30 molecular structure descriptors used in the LSTM model for PCE prediction. (a) Shows a bar chart, (b) shows a scatter plot. |

From Fig. 7(a), it can be seen that the eight descriptors with the most significant impact on PCE prediction are fr_bicyclic_A (number of two or more rings in the acceptor), NumRotatableBonds_A (number of rotatable bonds in the acceptor molecule), fr_unbrch_alkane_A (number of unbranched aliphatic groups in the acceptor), NumAromaticCarbocycles_A (number of aromatic carbocyclic rings in the acceptor molecule), fr_halogen_A (number of halogen groups in the acceptor molecule), NumAliphaticCarbocycles_A (number of alicyclic alkyl rings in the acceptor molecule), fr_unbrch_alkane_D (number of unbranched aliphatic groups in the donor molecule), and fr_halogen_D (number of halogen groups in the donor molecule). Fig. 7(b) shows that fr_bicycle_A, NumRotatableBonds_A, fr_halogen_A, NumAliphaticCarbocycles_A, and fr_halogen_D are positively correlated with PCE, while fr_unbrch_alkane_A, NumAromaticCarbocycles_A, and fr_unbrch_alkane_D are negatively correlated with PCE. The descriptor with the most significant impact on J_{SC} is fr_bicycle_A, showing a clear positive correlation. Suthar's study pointed out that the number of bicyclic structures in a molecule has a significant positive impact on both PCE and J_{SC},^{86} which is consistent with the results of this study. Zhang and He et al. reported that by changing the linear configuration of the alkyl substituents on the thiophene ring and using the polymer donor PBDB-TF, the power conversion efficiency (PCE) of BTIC-TCl-b with branched side chains reached 16.17%, significantly higher than that of BTIC-TCl-l with unbranched aliphatic chains.^{87} The results of this study indirectly confirm that fr_unbrch_alkane_A has a negative effect on PCE. For V_{OC}, the descriptor with the most significant impact is fr_allylic_oxid_A (number of allylic oxide groups in the acceptor molecule), showing a negative correlation. For the FF, the descriptor with the most significant impact is NumRotatableBonds_A, showing a positive correlation.

After prediction, high-performance donor–acceptor pairs were selected, and donor–acceptor pairs from the database were then deleted. For PCE, 7632 donor–acceptor pairs with PCE greater than 18.00% were obtained, with the highest PCE being 18.52%. There were five donor–acceptor pairs with PCE greater than 18.50%, and their structures are shown in Fig. 8. Each molecule of the donor contains halogen atoms, and each molecule of the acceptor has a fused ring structure and also includes halogen atoms, consistent with the obtained SHAP analysis. For V_{OC}, 888 donor–acceptor pairs with V_{OC} greater than 1.40 V were obtained, with the highest V_{OC} being 1.43 V. For J_{SC}, 17767 donor–acceptor pairs with J_{SC} greater than 25.50 mA cm^{−2} were obtained, with the highest J_{SC} being 25.95 mA cm^{−2}. For the FF, 150 donor–acceptor pairs with FF greater than 81.00% were obtained, with the highest FF being 82.22%. The structures of these donor–acceptor pairs with the highest predicted values are shown in Fig. 9.

The SMILES strings of the donor–acceptor pairs and the corresponding prediction results for PCE, V_{OC}, J_{SC}, and FF are stored in the attached files. pre_PCE.csv contains the SMILES strings and PCE prediction results, pre_Voc.csv contains the SMILES strings and V_{OC} prediction results, pre_Jsc.csv contains the SMILES strings and J_{SC} prediction results, and pre_FF.csv contains the SMILES strings and FF prediction results.

- N. Armaroli and V. Balzani, Angew. Chem., Int. Ed., 2006, 46, 52–66 CrossRef .
- K. A. Mazzio and C. K. Luscombe, Chem. Soc. Rev., 2015, 44, 78–90 RSC .
- P. Cheng, G. Li, X. Zhan and Y. Yang, Nat. Photonics, 2018, 12, 131–142 CrossRef CAS .
- Y. Cui, P. Zhu, X. Liao and Y. Chen, J. Mater. Chem. C, 2020, 8, 15920–15939 RSC .
- O. Inganäs, Adv. Mater., 2018, 30, 1800388 CrossRef .
- L. Lu, T. Zheng, Q. Wu, A. M. Schneider, D. Zhao and L. Yu, Chem. Rev., 2015, 115, 12666–12731 CrossRef CAS PubMed .
- H. Chen, Y. Zou, H. Liang, T. He, X. Xu, Y. Zhang, Z. Ma, J. Wang, M. Zhang, Q. Li, C. Li, G. Long, X. Wan, Z. Yao and Y. Chen, Sci. China: Chem., 2022, 65, 1362–1373 CrossRef CAS .
- H. Liu, Y. Geng, Z. Xiao, L. Ding, J. Du, A. Tang and E. Zhou, Adv. Mater., 2024 DOI:10.1002/adma.202404660 .
- J. Hachmann, R. Olivares-Amaya, A. Jinich, A. L. Appleton, M. A. Blood-Forsythe, L. R. Seress, C. Román-Salgado, K. Trepte, S. Atahan-Evrenk, S. Er, S. Shrestha, R. Mondal, A. Sokolov, Z. Bao and A. Aspuru-Guzik, Energy Environ. Sci., 2014, 7, 698–704 RSC .
- I. Y. Kanal, S. G. Owens, J. S. Bechtel and G. R. Hutchison, J. Phys. Chem. Lett., 2013, 4, 1613–1623 CrossRef CAS .
- A. Mishra and P. Bäuerle, Angew. Chem., Int. Ed., 2012, 51, 2020–2067 CrossRef CAS PubMed .
- M. C. Scharber, D. Mühlbacher, M. Koppe, P. Denk, C. Waldauf, A. J. Heeger and C. J. Brabec, Adv. Mater., 2006, 18, 789–794 CrossRef CAS .
- T. Yagi, R. Satoh, Y. Yamada, H. Kang, H. Miyao and K. Sawa, J. Soc. Inf. Disp., 2012, 20, 526–532 CrossRef CAS .
- X. Jiaxuan, Cluster Comput., 2018, 22, 4829–4835 CrossRef .
- Y. Q. Pan and G. Y. Sun, ChemSusChem, 2019, 12, 4570–4600 CrossRef CAS .
- C. Yan, S. Barlow, Z. Wang, H. Yan, A. K. Y. Jen, S. R. Marder and X. Zhan, Nat. Rev. Mater., 2018, 3, 18003 CrossRef CAS .
- J. Zhang, H. S. Tan, X. Guo, A. Facchetti and H. Yan, Nat. Energy, 2018, 3, 720–731 CrossRef CAS .
- L. Ma, C. R. Zhang, M. L. Zhang, X. M. Liu, J. J. Gong, Y. H. Chen, Z. J. Liu, Y. Z. Wu and H. S. Chen, Adv. Theory Simul., 2023, 7, 2300624 CrossRef .
- H.-Y. Yu, C.-R. Zhang, M.-L. Zhang, X.-M. Liu, J.-J. Gong, Z.-J. Liu, Y.-Z. Wu and H.-S. Chen, New J. Chem., 2022, 46, 20204–20216 RSC .
- M. Zhao, C. R. Zhang, M. L. Zhang, X. M. Liu, J. J. Gong, Z. J. Liu, Y. H. Chen and H. S. Chen, Int. J. Quantum Chem., 2022, 123, e27047 CrossRef .
- Z. Gan, L. Wang, J. Cai, C. Guo, C. Chen, D. Li, Y. Fu, B. Zhou, Y. Sun, C. Liu, J. Zhou, D. Liu, W. Li and T. Wang, Nat. Commun., 2023, 14, 6297 CrossRef CAS PubMed .
- J. Song, C. Zhang, C. Li, J. Qiao, J. Yu, J. Gao, X. Wang, X. Hao, Z. Tang, G. Lu, R. Yang, H. Yan and Y. Sun, Angew. Chem., Int. Ed., 2024, 63, e202404297 CrossRef CAS PubMed .
- P. Wang, J. Zhang, D. Luo, J. Xue, L. Zhang, H. Mao, Y. Wang, C. Yu, W. Ma and Y. Chen, Adv. Funct. Mater., 2024 DOI:10.1002/adfm.202402680 .
- Q. Xie, X. Deng, C. Zhao, J. Fang, D. Xia, Y. Zhang, F. Ding, J. Wang, M. Li, Z. Zhang, C. Xiao, X. Liao, L. Jiang, B. Huang, R. Dai and W. Li, Angew. Chem., Int. Ed., 2024, 63, e202403015 CrossRef CAS PubMed .
- J. Wang, Z. Zheng, P. Bi, Z. Chen, Y. Wang, X. Liu, S. Zhang, X. Hao, M. Zhang, Y. Li and J. Hou, Natl. Sci. Rev., 2023, 10, nwad085 CrossRef CAS PubMed .
- H. Tian, Y. Ni, W. Zhang, Y. Xu, B. Zheng, S. Y. Jeong, S. Wu, Z. Ma, X. Du, X. Hao, H. Y. Woo, L. Huo, X. Ma and F. Zhang, Energy Environ. Sci., 2024, 17, 5173–5182 RSC .
- W. Xu, H. Tian, Y. Ni, Y. Xu, L. Zhang, F. Zhang, S. Wu, S. Y. Jeong, T. Huang, X. Du, X. Li, Z. Ma, H. Young Woo, J. Zhang, X. Ma, J. Wang and F. Zhang, Chem. Eng. J., 2024, 493, 152558 CrossRef CAS .
- L. Zhang, M. Zhang, Y. Ni, W. Xu, H. Zhou, S. Ke, H. Tian, S. Y. Jeong, H. Y. Woo, W.-Y. Wong, X. Ma and F. Zhang, ACS Mater. Lett., 2024, 6, 2964–2973 CrossRef CAS .
- H. Zhou, Y. Sun, M. Zhang, Y. Ni, F. Zhang, S. Y. Jeong, T. Huang, X. Li, H. Y. Woo, J. Zhang, W. Y. Wong, X. Ma and F. Zhang, Sci. Bull., 2024 DOI:10.1016/j.scib.2024.07.027 .
- X. Cai, Y. Chen, B. Sun, J. Chen, H. Wang, Y. Ni, L. Tao, H. Wang, S. Zhu, X. Li, Y. Wang, J. Lv, X. Feng, S. A. T. Redfern and Z. Chen, Nanoscale, 2019, 11, 8260–8269 RSC .
- C. Chen, Y. Zuo, W. Ye, X. Li, Z. Deng and S. P. Ong, Adv. Energy Mater., 2020, 10, 1903242 CrossRef CAS .
- Y. Chen, Z. Lao, B. Sun, X. Feng, S. A. T. Redfern, H. Liu, J. Lv, H. Wang and Z. Chen, ACS Mater. Lett., 2019, 1, 375–382 CrossRef CAS .
- S.-S. Wan, X. Xu, Z. Jiang, J. Yuan, A. Mahmood, G.-Z. Yuan, K.-K. Liu, W. Ma, Q. Peng and J.-L. Wang, J. Mater. Chem. A, 2020, 8, 4856–4867 RSC .
- A. Mahmood and J.-L. Wang, Energy Environ. Sci., 2021, 14, 90–105 RSC .
- J.-H. Li, C.-R. Zhang, M.-L. Zhang, X.-M. Liu, J.-J. Gong, Y.-H. Chen, Z.-J. Liu, Y.-Z. Wu and H.-S. Chen, Org. Electron., 2024, 125, 106988 CrossRef CAS .
- M. Li, C. R. Zhang, M. L. Zhang, J. J. Gong, X. M. Liu, Y. H. Chen, Z. J. Liu, Y. Z. Wu and H. S. Chen, Phys. Status Solidi A, 2024, 221, 2400008 CrossRef CAS .
- C.-R. Zhang, M. Li, M. Zhao, J.-J. Gong, X.-M. Liu, Y.-H. Chen, Z.-J. Liu, Y.-Z. Wu and H.-S. Chen, J. Appl. Phys., 2023, 134, 153104 CrossRef CAS .
- H. Sahu and H. Ma, J. Phys. Chem. Lett., 2019, 10, 7277–7284 CrossRef CAS PubMed .
- H. Sahu, W. Rao, A. Troisi and H. Ma, Adv. Energy Mater., 2018, 8, 1801032 CrossRef .
- H. Sahu, F. Yang, X. Ye, J. Ma, W. Fang and H. Ma, J. Mater. Chem. A, 2019, 7, 17480–17488 RSC .
- G. Han and Y. Yi, Angew. Chem., Int. Ed., 2022, 61, e202213953 CrossRef CAS PubMed .
- S. Nagasawa, E. Al-Naamani and A. Saeki, J. Phys. Chem. Lett., 2018, 9, 2639–2646 CrossRef CAS .
- W. Sun, Y. Zheng, K. Yang, Q. Zhang, A. A. Shah, Z. Wu, Y. Sun, L. Feng, D. Chen, Z. Xiao, S. Lu, Y. Li and K. Sun, Sci. Adv., 2019, 5, eaay4275 CrossRef CAS .
- T. W. David, H. Anizelli, T. J. Jacobsson, C. Gray, W. Teahan and J. Kettle, Nano Energy, 2020, 78, 105342 CrossRef CAS .
- Y. Wu, J. Guo, R. Sun and J. Min, npj Comput. Mater., 2020, 6, 120 CrossRef CAS .
- J. Huang, B. Li, J. Zhu and J. Chen, Multimed. Tool. Appl., 2017, 76, 20231–20247 CrossRef .
- H. Li, P. He, S. Wang, A. Rocha, X. Jiang and A. C. Kot, IEEE Trans. Inf. Forensics Secur., 2018, 13, 2639–2652 Search PubMed .
- Y. Liu, K. Wang, C. Zong and K.-Y. Su, Comput. Speech Lang., 2019, 55, 216 CrossRef .
- T. Lu, Y. Wang, R. Xu, W. Liu, W. Fang and Y. Zhang, Multimed. Tool. Appl., 2022, 81, 6305–6330 CrossRef .
- A. Majumdar, R. Singh and M. Vatsa, IEEE Trans. Pattern Anal. Mach. Intell., 2017, 39, 1273–1280 Search PubMed .
- R. Wadawadagi and V. Pagi, Artif. Intell. Rev., 2020, 53, 6155–6195 CrossRef .
- Z. Zhang, P. Luo, C. C. Loy and X. Tang, IEEE Trans. Pattern Anal. Mach. Intell., 2016, 38, 918–930 Search PubMed .
- D. Weininger, J. Chem. Inf. Comput. Sci., 1988, 28, 31–36 CrossRef CAS .
- S.-P. Peng and Y. Zhao, J. Chem. Inf. Model., 2019, 59, 4993–5001 CrossRef CAS PubMed .
- S.-P. Peng, X.-Y. Yang and Y. Zhao, Int. J. Mol. Sci., 2021, 22, 9099 CrossRef CAS PubMed .
- G. J. Moore, O. Bardagot and N. Banerji, Adv. Theory Simul., 2022, 5, 2100511 CrossRef CAS .
- S. Hochreiter and J. Schmidhuber, Neural Comput., 1997, 9, 1735–1780 CrossRef CAS PubMed .
- A. Datta, S. Sen and Y. Zick, Presented in Part at the 2016 IEEE Symposium on Security and Privacy (SP), 2016 Search PubMed .
- S. Lipovetsky and M. Conklin, Appl. Stoch Model Bus. Ind., 2001, 17, 319–330 CrossRef .
- M. T. Ribeiro, S. Singh and C. Guestrin, Presented in Part at the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016 Search PubMed .
- E. Štrumbelj and I. Kononenko, Knowl. Inf. Syst., 2013, 41, 647–665 CrossRef .
- O. D. Suarez, S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller and W. Samek, PLoS One, 2015, 10, e0130140 CrossRef .
- A. Shrikumar, P. Greenside and A. Kundaje, arXiv, 2017, preprint, arXiv:1704.02685, DOI:10.48550/arXiv.1704.02685.
- RDKit: Open-source Cheminformatics, https://www.rdkit.org/, accessed March 25, 2024 Search PubMed.
- G. Long, A. Li, R. Shi, Y. C. Zhou, X. Yang, Y. Zuo, W. R. Wu, U. S. Jeng, Y. Wang, X. Wan, P. Shen, H. L. Zhang, T. Yan and Y. Chen, Adv. Electron. Mater., 2015, 1, 1500217 CrossRef .
- G. Long, R. Shi, Y. Zhou, A. Li, B. Kan, W.-R. Wu, U. S. Jeng, T. Xu, T. Yan, M. Zhang, X. Yang, X. Ke, L. Sun, A. Gray-Weale, X. Wan, H. Zhang, C. Li, Y. Wang and Y. Chen, J. Phys. Chem. C, 2017, 121, 5864–5870 CrossRef CAS .
- G. Long, B. Wu, A. Solanki, X. Yang, B. Kan, X. Liu, D. Wu, Z. Xu, W. R. Wu, U. S. Jeng, J. Lin, M. Li, Y. Wang, X. Wan, T. C. Sum and Y. Chen, Adv. Energy Mater., 2016, 6, 1600961 CrossRef .
- Y. Zhou, G. Long, A. Li, A. Gray-Weale, Y. Chen and T. Yan, J. Mater. Chem. C, 2018, 6, 3276–3287 RSC .
- Y. N. Dauphin, A. Fan, M. Auli and D. Grangier, arXiv, 2016, preprint, arXiv:1612.08083, DOI:10.48550/arXiv.1612.08083.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai and S. Chintala, arXiv, 2019, preprint, arXiv:1912.01703, DOI:10.48550/arXiv.1912.01703.
- S. E. Ozturk, R. Isci, S. Faraji, B. Sütay, L. A. Majewski and T. Ozturk, Eur. Polym. J., 2023, 191, 112028 CrossRef CAS .
- H. Gao, C. Han, X. Wan and Y. Chen, Ind. Chem. Mater., 2023, 1, 60–78 RSC .
- H. A. Afan, A. Yafouz, A. H. Birima, A. N. Ahmed, O. Kisi, B. Chaplot and A. El-Shafie, Nat. Hazards, 2022, 112, 1527–1545 CrossRef .
- C. Lu, W. Ma, R. Wang, S. Deng and Y. Wu, Complex Intell. Systems, 2022, 9, 2081–2099 CrossRef .
- J. Sadaiyandi, P. Arumugam, A. K. Sangaiah and C. Zhang, Electronics, 2023, 12, 4423 CrossRef .
- Z. Chen, Q. Li, Y. Jiang, H. Lee, T. P. Russell and Y. Liu, J. Mater. Chem. A, 2022, 10, 16163–16170 RSC .
- Z. Cao, J. Chen, S. Liu, X. Jiao, S. Ma, J. Zhao, Q. Li, Y.-P. Cai and F. Huang, ACS Appl. Mater. Interfaces, 2020, 12, 9545–9554 CrossRef CAS PubMed .
- C. Zhang, J. Yuan, K. L. Chiu, H. Yin, W. Liu, G. Zheng, J. K. W. Ho, S. Huang, G. Yu, F. Gao, Y. Zou and S. K. So, J. Mater. Chem. A, 2020, 8, 8566–8574 RSC .
- J. Liao, P. Zheng, Z. Cai, S. Shen, G. Xu, H. Zhao and Y. Xu, Org. Electron., 2021, 89, 106026 CrossRef CAS .
- M. j. Sung, B. Park, J. Y. Choi, J. Kim, C. Sun, H. Kang, S. Kwon, S.-Y. Jang, Y.-H. Kim, K. Lee and S.-K. Kwon, Dyes Pigm., 2020, 180, 108369 CrossRef CAS .
- D. Li, N. Deng, Y. Fu, C. Guo, B. Zhou, L. Wang, J. Zhou, D. Liu, W. Li, K. Wang, Y. Sun and T. Wang, Adv. Mater., 2022, 35, 2208211 CrossRef PubMed .
- F. Feaugas, T. Nicolini, G. H. Roche, L. Hirsch, O. J. Dautel and G. Wantz, Sol. RRL, 2022, 7, 2200815 CrossRef .
- Y. Wang, M. B. Price, R. S. Bobba, H. Lu, J. Xue, Y. Wang, M. Li, A. Ilina, P. A. Hume, B. Jia, T. Li, Y. Zhang, N. J. L. K. Davis, Z. Tang, W. Ma, Q. Qiao, J. M. Hodgkiss and X. Zhan, Adv. Mater., 2022, 34, 2206717 CrossRef CAS PubMed .
- P. Wang, F. Bi, Y. Li, C. Han, N. Zheng, S. Zhang, J. Wang, Y. Wu and X. Bao, Adv. Funct. Mater., 2022, 32, 2200166 CrossRef CAS .
- L. Wang, M. Hu, Y. Zhang, Z. Yuan, Y. Hu, X. Zhao and Y. Chen, Polymer, 2022, 255, 125114 CrossRef CAS .
- R. Suthar, A. T and S. Karak, J. Mater. Chem. A, 2023, 11, 22248–22258 RSC .
- P. Tan, C. Cao, Y. Cheng, H. Chen, H. Lai, Y. Zhu, L. Han, J. Qu, N. Zheng, Y. Zhang and F. He, J. Mater. Chem. A, 2023, 11, 9538–9545 RSC .

## Footnote |

† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4ta04665j |

This journal is © The Royal Society of Chemistry 2024 |