# Computerized Enumeration of the Alkane Series

Back in high school, part of my graduation requirements was to complete a ‘Senior Thesis’. Having a keen interest in computers and chemistry (at the time), I choose to investigate a problem that combined both. Specifically, I wanted to develop an algorithm which could find all structural isomers of a simple type of hydrocarbon – namely alkanes, a class of molecule composed solely of carbon and hydrogen in a specific ratio (C_{x}H_{2x+2}).

**The Chemistry**

Alkanes are simple tree-like structures without any loops (a.k.a. cycles). If you remember your high school chemistry, you’ll recall that carbon atoms like to form 4 bonds, while hydrogen atoms like to form 1. Due to the specific ratio in alkanes, that means that only single bonds will form, so the tree structure is also homogeneous. Alkanes are typically used in combustion processes because they burn easily and (relatively) cleanly – Methane, Ethane, Propane, Butane, and Octane are all Alkanes.

Structural isomers (or just ‘isomers’) are molecules which have the same formula, but a fundamentally different structure. For example, there are three isomers of Pentane (C_{5}H_{12}):

(Due to the nature of alkanes, it turns out that we can ignore the hydrogen atoms – they will simply fill in any unbonded locations on the carbon atoms).

This is a straight-forward concept – but exactly calculating isomers for a given formula quickly becomes very complicated. The reason is that structures which appear different can actually be equivalent by a series of pivots. Consider these renderings of C_{17}H_{36}:

The two configurations appear to be different, but through a series of mirroring and pivoting of bonds we can find that they are actually the same fundamental structure. These are referred to as stereo-isomers; structural isomers cannot be transformed from one to another without breaking any bonds. My goal was to create an algorithm which would find all structural isomers without generating any stereo-isomers.

The naive solution to this problem is fairly obvious: iterate through all possible structures, and for each one compare it to all previously recorded structures to see if it’s unique, and if so record it. However, this algorithm breaks down very quickly due to the incredible rate at which isomers increase with the number of atoms involved: C_{12}H_{26} has 355 isomers, C_{20}H_{42} has 366,376 isomers, C_{24}H_{50} has 14,506,015 isomers, and so on. And that’s *structural* isomers – the number of possible stereo-isomers increases at least an order of magnitude faster. Besides large growth rate, comparing the ‘equality’ of two structures is an expensive operation – you have to consider all possible pivots. The complexity of this operation alone grows significantly with the number of atoms involved. An exact mathematical analysis of the order of growth involved would be an entire project onto itself, but it suffices to say that we need to be more clever in order to come up with useful results – at least if we want them before the end of time.

**The Algorithm**

There are a number of ways of attacking this problem. Prior algorithms have developed novel ways to compare structures more efficiently (e.g. creating stable structural hashes). These algorithms were a significant step up from the naive solution, but they still suffered a flaw: they still have to generate all the possible stereo-isomers. Even if you are able to bring the comparison operation down to linear time, you still have to do the comparison.

I sought to create an algorithm which dispensed with the comparison step entirely – by not generating stereo-isomers in the first place.

There are two parts to my algorithm: molecule building, and validity rules.

The molecule building component iterates through possible structures by placing atoms onto an incomplete molecule until all atoms are placed. This is done using a simple recursive algorithm. Whenever a complete molecule is formed, it is recorded to disk. Whenever the molecule building component wants to place an atom, it asks the rule system whether that placement is legal. So by constructing appropriate rules, it should be possible to prevent ‘illegal’ structures from forming.

But what are appropriate rules? The root cause of redundant structure generation is that identical structures can look different through a series of pivots. So we need to create rules such that any pivoting results in an illegal (or identical) structure.

The key is this: at every atom we prioritize the 4 bonds and enforce that the higher priority bonds have sub-trees which are ‘greater’ (or ‘equal to’) the sub-trees of lower priority bonds. If we enforce that rule, then any pivot at any point will result an illegal (or identical) structure – which means it will not have been generated. Also, any arbitrary structure can be rearranged into its ‘legal’ form, so we can be sure that all possible structures will be generated.

Comparison of sub-trees is done as follows:

- The sub-tree with the longest chain is ‘greater’
- If the longest chains are the same length, we compare the sub-sub-tree of the 1st priority bond on the 1st atom of each sub-tree.
- If they are equal, we compare the 2nd priority bonds
- If all the sub-sub-trees of the 1st atoms are equal, we compare the sub-sub-trees of the 2nd atoms.

This continues recursively until a divergence is found, or both sub-trees are found to be equal.

**The Results**

There are a number of advantages to this algorithm. It’s relatively confined in memory – all molecule manipulations are performed in-place. The only growth in memory is the call-stack for recursive functions. The real advantage of this algorithm is that it is O(n), where n is the number of possible isomers. i.e. finding an isomer takes a constant amount of time. Previous algorithms were O(n^{2}) or worse. (It may seem odd to measure the complexity against the size of the output, and not on an input. I chose this metric because it is more revealing of the performance of the underlying algorithm – measuring the complexity against the input confounds the complexity of the algorithm with the high complexity of the underlying physical problem.)

This algorithm has its’ limitations though – as I mentioned earlier it depends heavily on the call stack. This became an issue with larger molecules, as the program causes a stack overflow. The problem could be mitigated my manually maintaining the call stack, or switching to a non-recursive architecture and maintaining state separately. This level of sophistication is outside the scope of my thesis, but it would be an interesting extension. The larger limitation is that the algorithm isn’t flexible – it only works for non-branching, homogeneous trees. With some research, it may be possible to incorporate non-homogeneous structures (e.g. different types of atoms or double/triple bonds). However, it’s pretty much impossible to incorporate cycles, because doing so breaks down the requisite isolation of sub-trees.

Besides its contributions to computational chemistry, my algorithm may have applications to other areas of computer science. Acyclic trees are a very important and widely used concept, and there may be areas where determining possible structures would be useful (although I’m not personally familiar with a specific case).

I’ve attached a copy of the actual thesis. It goes into quite a bit more detail on the history of this problem, the algorithm itself, and specifics on the results: Computerized Enumeration of the Alkane Series.pdf