Profile picture
Noel O'Boyle @baoilleach
, 29 tweets, 5 min read Read on Twitter
#11thICCS Roger Sayle of @nmsoftware on Recent advances in chemical & biological search systems: Evolution vs. revolution
Databases are growing at rates exceeding Moore's Law. Dbs that are twice as big take twice as long. But sublinear methods will not slow down that fast as dbs get bigger. At 1M mol/s searching, ChemEBL in 2s, PubChem 1.5min, Enamine REAL in 10mins.
Looking first at substructure searching. Use of binary fps to prescreen possible matches improves perf for typical queries. However, the use of fps does not affect the worse case, e.g. [X5], will bring many search systems to their knees.
Shows performance of different tools on Andrew Dalke's benchmark set of smarts queries. Ref to substructure-search-faceoff. OB 2 days, ChemAxon 2h, BIOVIA 50 minutes, SaCHEM (J. Cheminf.) in 16m. Total time is dominated by pathological queries.
Where is the time going and how can be improved? If you search for indole in eMolecules without a prescreen, 1% on file I/O, 59% on ring perception + aromaticity, and very little actually on the search.
To test, store all molecules in memory and do the search. 6s versus 120s for this search. Large memory footprint and there is contention between threads in multiprocessor systems.
"Arthor" Use compact on-disk representation of molecules that can be searched directly without reading into memory or creating molecule object. Can do the search in 3s.
Goes back to the earlier comparison graph. 27m for the brute force search, 46s for prescreen and 12s if 8 threads.
No live demo, but here's a short video. Available as postgres cartridge.
Moving onto similarity search. Traditionally calculated as the Tanimoto coeff between binary vectors. Shows CUDA code from @olexandr with three popcounts. Could be done in two or even one.
Choice of fps. ECFP4 is used these days rather than path-based fps like Daylight used. Baldi bounds make sense for Daylight but no longer valid for ECFP4 as they are very sparse.
Trick#1: Use hardware popcount to speed things up.

Trick#2: Sort fps by the number of bits. Useful for Baldi bounds but also for data storage, these are all the fps with 40bits set.
Trick#3: reciprocal multiplication. Avoid floating point division just can used integer multiplication with a reciprocal table instead.
Trick#4: the bottleneck is not the search thru the db, it's sorting the results NLogN. Avoid this with a counting sort.
Trick#5: Just-in-time complication. Write a program to work out the optimal code on-the-fly. Code-specialization.
Trick#5a: code skip empty words. Benzene only has three words with bits set, so only three popcounts are need.
#5b: For words with a single bit set, can avoid popcount using an AND.
#5c: Coalesce memory reads using Viterbi dynamic programming.
#5d: Popcount combining. If P and Q have no bits in common, can do both at the same time.
Done using graph coloring. Shows abilify. Draw line between words if have a bit in common. Graph coloring to use the minimise number of popcounts by choosing words that don't have a bit in common.
Using this approach, can reduce the number of popcounts from 5K to 3K. across the entire dataset.
Compares to previous work vs MadFast, ChemFP, OEGraphSim, and OpenBabel. Maxes out at around 500M FPs/sec versus 100FPs/s for ChemFP. GPUs faster but cannot support large databases as don't fit into GPU memory.
Future work is direct support for multiple GPU cards (federated search). Can support larger databases in this case. Future work is direct gen of NVidia SASL via cubin binaries. Pushed some fixes up to GCC9 to speed up popcounts.
Now talking about protein seq searching. If you want to name peptides in terms of other ones, you need fast searching of seq dbs. e.g. PDB 1CRN can be named as [L25I]P01542 if you can quickly search UniProt.
Algorithm to search: Longest common prefix. Rather than use BLAST etc use a suffix array. Store all suffixes in a database and sorting. Much bigger, but can be searched with a binary search rather than having to iterate over the whole database.
Now moving onto graph-edit distance search (SmallWorld). "Fight big data with even bigger data". Precalculate all of the substructures of all molecules. Have much bigger db but can search quickly. Sublinear-scaling search method.
Turn the entirety of chemical space into a Facebook or LinkedIn friends-of-my-friends view. Instead of 340M molecules in 68billion subgraphs. But those subgraphs can be searched much more quickly - don't have to look at them all to find nearest nbrs.
Graph-edit distance (GED) is the min no of edit operations to turn one molecular graph into another. Add/remove ring bonds, terminal bonds, link atoms.
Shows demo. Can find matched pairs of the queries, super and sub structures in real-time. One of the outcomes is the atom-mapping between the query and the hits.
Shows ExScientia example and how graph edit distance could have found the changes they used but traditionally FPs could not have.
Rarey asks about counting fps and unfolded fps. Answer: would be a nice functionality to add.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Noel O'Boyle
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!