AI and Synthetic Organic Chemistry

Artificial intelligence (AI), also known as machine learning, has been a hot topic of discussion in recent years, with much excitement over the application of AI technology in big data analysis for the creation of “smart” homes and even “smart” cities. At the same time, there has been concerns over the implications for personal privacy as AI technology is being applied to facial recognition enabled CCTV cameras in some countries.

With all the hype around AI, it is not surprising that machine learning also holds promise in the field of synthetic organic chemistry, focused on proposing and probing reaction schemes to produce target compounds from commercially available starting reagents. This is especially important in the field of pharmaceutics, where synthesis routes for drug candidates must have sufficiently high yield to be economically viable. Machine learning in synthetic organic chemistry mainly serves four purposes:

Designing a synthetic route to a target molecule

When designing synthetic routes, chemists aim to arrive at the desired target molecule from commercially available starting reagents, with a route that has sufficiently high yield.

Prediction of reaction products

Chemists often want to predict the products of a certain reaction, as it gives information about reaction yield and the possible production of side-products that have to be removed through purification steps.

Optimisation of reaction conditions

As yield is to be maximised in any synthetic route, chemists interested in the optimal reaction conditions to maximise the yield of every reaction step.

New and novel reactivity

Discovery of new chemical reactions expands the toolkit which synthetic chemists have for target molecule synthesis.

Before we go into how artificial intelligence can be applied to these 4 research aims, let us first briefly understand how artificial intelligence works.

Artificial intelligence takes in training data, formulates trends within the training data set, before using the trends obtained to predict outcomes for previously unseen data. An easy example of machine learning is a regression model, where an AI develops an equation to fit the datapoints to a trend. In addition to regression, which derives a mathematical trend based on input training data, there are also two other types of machine learning models, that is classification and clustering, as shown in the figure below:

Taken from reference.

Classification assigns a class to each data, and requires the machine learning to learn how to assign a class to an unseen data point. Clustering is different from classification in that no class is preassigned to the training dataset, instead leaving the machine to figure out its own way of clustering the datasets based on the data structure. Hence clustering is often called “unsupervised learning”, compared to classification and regression which are “supervised learning”, where there is a fixed “correct” response to a data input.

Using AI to design a synthetic route to a target molecule

Currently, computers are used in synthetic route proposals, with an example of a non-AI computational method being a similarity search. A similarity search first identifies the disconnection sites (where the target molecule is to be broken up into the intermediates of the step before), which are usually bonds which are rare within commercially available compounds. The computer then encodes the structural information about the disconnection sites into a series of ones and zeroes (a bit string), and this information about the disconnection sites is matched with a database of known reactions to see what reactions can be used to produce it.

The problem which such a method is that it not only neglects functional group conflicts which could cause unwanted side-reactions (or even prevent the desired reaction from occurring altogether), but also requires the searching through the entire reaction database for each step, taking up much time and computational effort.

The AI method to solving this research problem goes something like this: The entire molecule is encoded into a bit string, and an algorithm designed to mimic brain synapses determines the disconnection sites, and the type of reaction needed to form the disconnection sites. Then, a different algorithm is used to determine the appropriate intermediates to be used to produce the target molecule. This process is continued then to determine how the intermediates are to be synthesised, going on until we arrive at starting compounds which are commercially available, thus completing our synthesis pathway.

Such an AI-driven method is superior to the non-AI similarity search method, as there is no need to search through a large database of reactions for each step, making the entire process much faster, especially for syntheses which require many steps. In addition, algorithms have been designed which can prioritise the use of more selective reactions in the proposed synthetic pathway, and deprioritise the use of strained intermediates which may not be stable.

Using AI to predict the products of a reaction

Currently, the process of predicting the products of a reaction goes something like this:

Computational chemistry methods are used to calculate the electronic properties (e.g. electron density and orbital energies) of the functional groups in the reactants, which is then used to predict which and how the functional groups in the reactants would react. The problem is that such computational chemistry methods are slow and cumbersome, being unfeasible for larger and more complex molecules.

Hence, the use of AI can significantly speed things up, if we use “lower-level” chemistry theories, but enhance accuracy through machine learning. An example of a proposed AI method is to use graphical functions to approximate the orbitals of the reactants, and predict how electrons will flow to arrive at the products of the reaction, with the prediction influenced by input data of known reactions and their products for the system to learn from.

One problem with such a method, though, is the input data which the AI requires for it to learn from. For the most accurate results, the input data must consist of both known reactions and their products, but also reactions which are known to not produce any products in significant amounts. Unfortunately the latter is rare in reported literature, as chemists do not tend to report reactions which do not work.

Another problem is that the reactants involved in a reaction can contain many different functional groups, hence the question of what product will be produced is much more difficult to answer with AI compared to a more limited and narrow research question involving only one functional group or one type of reaction.

The above description of AI in predicting reaction products uses a classification model of AI, where either a compound is or is not a product of the reaction. Yet could we use a regression model to quantitatively predict product yields? It has been done before, but it is not so feasible as large amounts of high quality input is required in order to get a good mathematical model.

Using AI to optimise reaction conditions

Currently, optimisation of reaction conditions is often carried out in a haphazard way which is not justified by statistical theory. In addition, when multiple variables are of interest, often variables are investigated one at a time, which is a problem as we want to simultaneously optimise all the relevant conditions (temperature, solvent, catalyst, etc.) in order to be confident that we have really arrived at the optimal reaction conditions.

An example of how AI can be applied to this research question is as such: Firstly, computational chemistry methods are used to calculate the rate constants when the reaction is carried out under different conditions. This data is then put into a regression model, which identifies the condition which results in the highest rate constant. Such a method has already shown promise in the selection of the best solvent for an organic synthesis reaction, though more research is required to determine the efficacy of the simultaneous optimisation of more than 3 parameters.

Using AI to identify new and novel reactivity

Often, the discovery of new chemical reactions is attributed to “serendipity”, and if “serendipity” is to be interpreted as random strokes of luck, then perhaps it can be formalised as the random search of a database for two reactants, then investigating whether they react in an unexpected fashion.

In this application, active learning is a system of AI that is particularly of interest. An active learning AI system would take in information of known reactions as an input, then probe the database to select unknown reactions that are very different from the known reactions it has been trained on, investigate those reactions, and take in the results of its investigation to expand its input data of “known reactions”. In this way, by expanding the set of data the AI learns from into novel territory, the AI gets smarter and smarter, while avoiding the problem of chemical bias, which is when chemists tend to investigate reactions that follow a certain pattern they are familiar with. In addition, such active learning systems, since they constantly expand their learning input, can start with a smaller initial input dataset while still maintaining accuracy of its predictions.


In conclusion, machine learning has the opportunity to significantly increase the efficiency of chemists’ work in synthetic organic chemistry. While we never expect AI to be able to replace human chemical intuition built up from years of experience, AI can be harnessed as a useful tool for chemists to consider, as well as equip new and less experienced chemists with a systematic way to improve their research.


de Almeida, A.F., Moreira, R. & Rodrigues, T. Synthetic organic chemistry driven by artificial intelligence. Nat Rev Chem 3, 589–604 (2019).