Tweet

StatArb

Feb 19 • 16 tweets • 3 min read

Three pillars of my ML modeling philosphy:

Large quantity of unique features
Really good dimensionality reduction
Ensemble everywhere!

A word on each...

When it comes to modeling everyone always goes to their favorite NNs like LSTMs etc or LGBMs and those are great, but everyone has them, and frankly, they aren't that hard to implement! Just look at Kaggle if you want an example of DS students using them everywhere...

For real alpha, you need to focus on the three most ignored areas (there is a fourth, speed, but that's not really modeling, and a fifth which I'm not telling you because I like my alpha unleaked). That sounded super guru-like, but I promise these work and I use them.

Starting with the first, it is usually quantity over quality. When you build a massive alpha signal library, sure go ahead and focus on specific signals. Otherwise, use TA-Lib, Github, Stefan-Jansen repo, Kaggle, Wikipedia, any mass dump of features (preferably with code...

so that you can save time). It's about making lots of them not about having some cool special strategy. Those do work as well, but you need to build on the mass features afterward, not before. You need to get what most others have (usually only in part, but still most...

others will have quite a few of these features) before you can decide to become special. The next key point is dimensionality reduction. It's simple math. Even if you have loads of HFT data it scales so non-linearly that won't save you. If I have 100 samples in 1D because...

2D is ^2 in terms of volume, I now need 100^2 (10,000) samples to get the same amount of data effectively. Hopefully, that's intuitive, but it has to do with per unit of space how much data is in there. High density means lots of data of course, but a large volume as anyone...

knows means less density and effectively less data. If you have 100D you now need 1e+200 samples to be equal to 100 samples in 1D... yeah. So thats why we reduce dimensionality. Not because it's fun (although it is). Now don't just use PCA, it's a fucking linear regression at...

the end of the day! Start with PCA as I discussed in my last thread. PCA is not a dimensionality reduction, but it does separate mutual information from the residuals. When used as dim reduction you just chuck the residuals and assume they are just noise...

but for us, we will just say that they are not noise, but instead special features that have non-linearity. We can then use manifold learning on it, or just feed it into a supervised autoencoder (supervised with either an LSTM or MLP) and the SAE will find those non-linearities..

A lot of people trash AEs because they effectively just pick up on PCA, but if you think about it as learning, like anyone you start with the basics so obviously they pick up the linearly mutual relationships first. They just usually don't get super far into the meaty non...

linear parts. So that's why we do PCA first. We have done that work for it. It can now use its precious inference on the non-linear parts. Finally, ensemble, this can be done over multiple timeframes with hierarchical modeling (talked about this a lot in other threads...

) or it can be from stacking algorithms, like using an ARIMA as a feature (this is risky and you only do this with simple, linear models like AR because they wont overfit, don't put an NN in an NN please for the love of god, otherwise you would just reduce regularization...

in the original NN. That's why metalabelling can be stupid sometimes). The last form is just by using similar models. Like why not use LSTM and WaveNet and just ensemble them. It reduces overfitting as it's like having two judges and pretty much always the whole is greater than..

the average of all predictabilities, or even the max predictor is usually worse than the whole. I'll say it again for those who didn't hear it last time. A GARCH ensemble thrashes every single GARCH family model, and that's something considering so many exist/ can get lucky

End of thread. Enjoy your day, and remember that glitzy models like NNs you use at the end aren't really the alpha. I use like 800-1000 features (maybe like 5k since there are diff periods ones for diff timescales) btw in case you are wondering.

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

StatArb

Try unrolling a thread yourself!

More from @TerribleQuant

StatArb

StatArb

StatArb

StatArb

StatArb

StatArb

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Like this author's thread?