Molecular Generation and Long-Term Processing
26 Jun 2021 | 4 minutes to readI recently came across an paper from 2012 about generating organic molecules: Source Archive
The researchers spent large amounts of time performing an “enumeration of organic molecules up to 17 atoms of C, N, O, S, and halogens, forming the chemical universe database GDB-17 containing 166.4 billion organic molecules.”
Despite creating more than 400GB compressed, it only captured a small subset of all possible and impossible molecules. It used different tricks such as using basic structures, using known molecules as templates, and swapping out molecules with those of similar electron valence.
This dataset was produced in 2012, and since then many papers have used this paper and it’s data in various studies. Additionally, the authors have continued to update the dataset as recent as 2019. While it represents a large amount of possible organic molecules, it is in no way a comprehensive source. Even so, it is still useful today as a “filter” for generating additional molecules or as input data for machine learning models.
This was all nice to read about, but it made me think about what sort of long running, complex calculations we’ll make in the future. Eventually, we will reach a processor ceiling, where speed of light / laws of the universe limit the amount of computing power possible. We’ve already seen the start of this as we are creating finer and finer pcb generation processes.
Because of this limit, there could be a point where calculations we wish to perform are so complex or thorough that they take a long time to run, even with top of the line future-computers.
Things like neural network training, modeling and simulation can already take a long time now. As things become more complex and start pushing up against limits, calculations could take years to complete, just like the computation of the meaning of life in Hitchhikers Guide to the Galaxy.
Algorithms will be designed to take into account the long compute time. One easy method that we use now is paralellization. We are able to run workloads on many computers at once, spreading out the work.
Another method could be to design algorithms that generate an output over time, instead of starting, calculating, and producing a single answer at the end. By splitting the output into chunks emitted during the calculation, we can take advantage of the results sooner, even if they aren’t complete yet.
A similar method would be to allow input data to be updated. As this long running calculation takes place, other computers may start calculating the same thing, and it’s results could be useful to assist. If a computer is generating organic molecules like above, the ability to make the running calculating aware of externally calculated results can help it need to do less work, and speed up the calculation time.
The greatest benefit may be to start the calculations as soon as possible. If we can get them started now, then they will finish sooner. Even if it isn’t the most optimized algorithm or hardware, it will at least make progress that could be recycled into the next attempt to improve it’s performance.
As algorithms and systems are improved, the data itself will also undergo transformations. As we’ve seen in the past, trying to make a single, perfect format is impossible. Every format has benefits and tradeoffs, such as readability, storage format, or compression. Better to choose a format that works best for the time being, just to get you started. I think it is better to focus on creating good tools to transform/translate/convert data from one legacy format to the newest and best format. That could be a format in and of itself. If it was, I think it would be important to have it include a log of the previous transformations it went through, and the additional data added as a result. This would act the same as identifying primary, and tertiary source in historical research. There’s the original data of course, but often we can take that and draw further conclusions based on it. It would be important to include this metadata, but also properly record how it was generated.
With a proper way to manage data, it will make it easier to integrate it into calculations we wish to make, and set us up for better performance in the future. That just leaves the question: What calculations should we be beginning now?