Profile picture
Noel O'Boyle @baoilleach
, 17 tweets, 2 min read Read on Twitter
#11thICCS Greg Landrum on How do you build and validate 1500 models and what can you learn from them?
Really..."the Monster Model Factory".. Have >1500 datasets from CHEMBL that I want to build models for. Needs to be automated. Ideally we can learn s.t. about what makes model work vs not work
CRISP-DM - standard process for data mining solutions - see wikipedia
Key steps: Init, Load, Transform, Learn, Score, Evaluate, Deploy. With KNIME, create a workflow that does each step. A meta workflow.
Extracting the data. FIltering ChEMBL. Need at least 50 activies. <100nm. Finally 2.5 million data points and 1.5 million cmpds. Sidebar: Data is biased towards active compounds. Ratio of act:inacts not realistic. To fix, add assumed inactives.
It's a Knime workflow, so it could be cronjobbed.
Clean-up the chemical structures with #rdkit - only allow standard organic subset. Generate fps.
H2O library for gradient boosting. RF, Naive Bayes. 10 different stratified random partitions. Take the best of these models based on EF at 5% (EF5). Model params came from a full param opt on 70 assays. Used to pick a standard set.
Surprised to find a fairly low number of trees (100) for gradient boosting. (All slides will be on slideshare right after the talk)
Execution: build/test workflows on laptops. The server is on AWS. The actual running took place on AWS with distributed executors - a new/coming feature of KNIME.
Performance. mean AUC is 0.958 and s.d. is 0.070. Cohen's kappa not quite so good. Looks too good to be true. Literally.
To validate or check model generalizability, use the model built on one assay from a target ID to predict act across the other assays. Like prospective evaluation, or as close as we can do it.
Now can see that AUCs for some target are close to random but others pretty good. Similarly for EF5 - some worse than random, but others very good.
What's happening? Shows 5-HT6 example. EF5 of 0. Model compounds very different from test. Have overfitted the training data. Or have built a model to predict whether or not a cmpd is taken from a particular paper. Need to consider this and be careful not to fool yourself.
Which fps were picked? (ed: nice ring diagram) Which method/fp pair is best for each assay? Random forest doesn't appear at all (!).
Still work in progress in drawing conclusions.
RDKit UGM (Cambridge) and Knime Fall Summit (Austin, Texas) coming up.
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Noel O'Boyle
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!