Danny Groves Profile picture
Dec 10 29 tweets 12 min read Twitter logo Read on Twitter
Machine learning for finance can be tricky!

So let's explore with another step-by-step case study 🧵

This time, let's use ML to find time series patterns across multiple instruments.

The aim - to find like patterns in the market and build a scanner.

Enjoy! Image
OK, what ML tools are we using here?

This will be a clustering approach = unsupervised.

Unsupervised means we give the algorithm no guidance on what to look for.

I'm interested to see what patterns IT can find for me.

Yes, I'm lazy.
Algos like K means clustering don't really have a concept of a time-series

They're great at other things, of course, but I want to find time series patterns.

So let's consider a special kind of clustering.

Time-series clustering, using tslearn.
1. Download and Feature Engineer

For now let's keep it simple:
• 3 tickers - just for exploring the idea
• yfinance - free daily data
• 3 features - Close prices, range (%), open to close (%)

You can experiment with other features & instruments later Image
2. Feature Reshaping

Time series algos require data in the shape = (example #, feature length, feature #)

So first, let's create new cols in our df, one per feature & position in the time series

Since here I want a 50 length time series over 3 features, I'll have 150 new cols Image
2a. Feature Reshaping

Also, I assume that adjacent examples split apart by one day will be similar - therefore, let's undersample by selecting every 5th row.

This may save us some training time and still retain most of the variance in the examples. Image
2c. Feature Reshaping

With all these new cols, we now need to reshape into a 3D array

This is achieved with the attached code

I will say, this whole reshaping nonsense is tricky to wrap your head around! So don't worry if it's not immediately obvious 🙂
Image
Image
3. Scaling

Two of our features are % changes -> already scaled.

However, closing prices are not scaled!

Why do we need to scale?

We're aiming to say "this time series is similar to that one" - it could be tricky to judge this if they're completely different on price scales.
Image
Image
3a. Scaling

Once prices are on the same scale, seeing if one curve looks like another is easier

Since our goal is to find time-series similarity, scaling is a good idea

How do we scale?

Let's use standard scaling - the time series have a mean of 0 and standard deviation of 1 Image
3b. Scaling

Why this approach?

It tends to be a little less sensitive to outliers.

But at the end of the day, it's worth trying other ways too - because they could be better for your particular problem.
4. Clustering

Clustering is basically a way to sort data into groups.

However, we decide how many groups we want before we start

Therefore, the big question is - how many groups do we need? Image
4a. Clustering

Enter the elbow method

Basically, we fit by increasing K by 1 each time

Each fit gives a value which measures how compact the groupings are.

Lower the value = better fit.

Or does it? Image
4b. Clustering

If we push it too far, every point is it's own cluster, which isn't very useful.

On the other hand, having too many clusters might lead to overfitting, and we want our model to be more general.

The kneed package sorts this out for us, choosing the K at the elbow
4c. Clustering

The whole fitting process took me 153 seconds - not too bad!

The optimal k was 5.

We can also see that cluster 1 is the most popular in both train and test.

However, right now, this tells us nothing - let's view some charts!
Image
Image
5. Analyse the Clusters

Let's pick 4 random charts from cluster 0 - the highlight is the clustered point.

Clearly, it's picking some similarity - looks to me as if they're all in strong downtrends.

However, are they strikingly similar?

IMO - not really. Let's improve it.


Image
Image
Image
Image
6. Improvement

My hypothesis is that there are simply more than 5 ways the time series can form in the market.

So perhaps the elbow method isn't too helpful here.

Instead - let's up the granularity and find many more clusters.
6a. Improvement

I chose 50 - you may ask why. I have no intelligent response.

The main reason was to just be more granular, and see if it helps!

After all, it's all exploration for now.

So, did it help?

Let's explore with some charts.
Image
Image
6b. Improvement

Since 10 was the most popular cluster, I decided to take 4 random charts from that.

The results are much better now.

There's definitely more similarity - all in uptrends with a pause in momentum.

Can we do better again?


Image
Image
Image
Image
7. Dynamic Time Warping

Enter dynamic time warping (DTW).

DTW is a way of comparing two time series a little more robustly than other distance measures.

Let's break it down and explain why this is helpful.
7a. DTW

In pointwise matching, we would aim to get a distance measure by summing all the distances between each point.

However, if we shift the time series, the diffs on the y-axis may get large.

It's the same series, but the cost would be high.
7b. DTW

In DTW the points are matched up differently.

It's done in a way so that their y distance is smaller - it compares the two, equal time series efficiently even though they are shifted.

I won't go full into the details here, because that's easily another 🧵
7c. DTW

By the way, this also works on time series of different lengths, which is a nice perk (as not all time-series patterns are created equal).

The outcome of the method is a single number, the warping cost.

Smaller cost = more similar the time series. Image
7d. DTW

Why am I explaining all this?

tslearn allows you to use dtw in the metric keyword argument.

With this, we should expect a more robust time series comparison!
Image
Image
8. Cluster Analysis - DTW

Cluster 30 is the most popular here, let's check out some charts.

These look even more similar to me - strong uptrend followed by a momentum pause.

IMO - it's super neat that an algo can find this, with absolutely no direction!


Image
Image
Image
Image
8a. Cluster Analysis - DTW

However - this increased performance comes at a cost.

DTW is not a particularly parallelisable algo, meaning it's really slow.

In fact this case took me 3-5 business days to run.

I joke, it was actually 1971 seconds (30 minutes...)
9. Conclusions

So what did we learn here?

tslearn can be used to group like patterns in the market, without any direction, and completely on time-series data.

Euclidean does a good job, DTW a better one - but at the cost of it being really slow.
9a. Conclusions

It's better asking for more granularity, the elbow method is not particularly helpful for us here.

However, how granular still remains a question!

A topic for further research.
9b. Conclusions

Remember, this was on 3 listed stocks, 3 simple features, and to be fair, they're all similar ones at that (i.e. large tech stocks).

It's not the most "robust" approach, but illustrates the idea and should be easy to extend.
Anyway, that's enough rambling for today.

The code has been committed to my git repo, the link is in my bio.

And, If you are interested in applying ML/data-science to finance then follow me @DrDanobi for more threads like these!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Danny Groves

Danny Groves Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @DrDanobi

Dec 12
Machine learning for finance is not all about prediction.

So let's take a step-by-step case study in this 🧵

This time - let's cluster trades to try and find similar ones in the future

Enjoy! Image
First of all let me say that I love prediction problems, it's fun stuff.

However, prediction can be tricky - you have to be careful with what you ask the model or it may just not find anything at all.

Unsupervised is different, you let it find the patterns for you!
What's the idea here? We will:

1. Take a simple trading strategy
2. Find examples over a small basket of stocks
3. Ask ML to group these past trades

Why?

I'd like to see if it can find similar looking trades in the future, so we get an expectancy on how they should behave.
Read 29 tweets
Dec 5
Machine learning for finance is hard.

So let's take a step-by-step case study in this 🧵

This time - let's ask ML to find patterns without giving it any direction, and use those patterns to mitigate risk.

Oh, and we'll backtest the ML model too!

Enjoy! Image
To start, let's briefly introduce two fields of ML:

1. Unsupervised - grouping like cases, where the ML is given no direction.
2. Supervised - learning from labelled data (e.g. this is a cat, this is a dog)

We're going to use 1 to find a pattern, and 2 to de-risk it.
1. Data + Feature Engineering

To keep it simple, let's get daily SPY data from yfinance - with this we'll get ~20 years of daily data.

The features we'll use are also simple:
• Price change from the SMA
• Price change from the n day max/min
• Price Changes Image
Read 30 tweets
Dec 1
In the linked tweet I showed an application for ML in finance.

Predicting trends in the SPY.

This was with a random forest model, so, how does a deep learning LSTM compare? 🤔

Can it learn from the OLHCV time series alone?

Let's explore in this step-by-step 🧵
First of all - let me say that this thread will be quite similar to the attached one.

However, for completeness, I'll describe the full deets here as well.

But feel free to skip parts if you already read the last one 🙂
1. Getting the data

This is the easy part.

yfinance gives you ~20 years of daily data.

100% free! Image
Read 28 tweets
Nov 28
Machine learning for finance is hard!

So let's take a simple case, and break it down step-by-step in this 🧵

The application - a simple trend prediction on the SPY using a random forest.

Enjoy! Image
1. Getting the data

This is the easy part!

Since we are looking at the daily time frame, we can use yfinance to get the data in just two lines of code.

The output will be a pandas dataframe containing ~ 20 yrs of data. Image
2. Deriving the features

The model needs something to learn from, so as a start, let's give it:
• % change between the price and SMA
• % change from the price and rolling max/min
• % change in the price from n days ago

Notice a pattern? Image
Read 23 tweets
Nov 27
Distributions in trading are strange

An example = high-of-the-day time for red day small cap gappers

The mean + median look... well.. off.

Here's a tool I like - Kernel Density Estimation

Inside:
• A quick KDE intro
• How to use it in Python
• Some analysis ideas

Enjoy! Image
Alright, let's start with one reason why I like a KDE

In this example, the distribution clearly isn't normal (e.g. bell-shaped)

E.g. the mean isn't a great average estimate

The median seems distant from the "peak" of the data too.

A KDE can approx where that peak is. Image
⚠️Full warning - this is a technical thread, proceed if you dare⚠️

Also, this is just an analysis tool - how you apply to you data and trading, is completely up to your imagination.

Although, to spark some inspiration, I will speak about some ways I use it!
Read 16 tweets
Nov 12
The Stockbee Market Monitor by @PradeepBonde is probably one of the best free resources out there.

It's amazing for gauging situational awareness of the market.

Let's take a look from a data-science perspective in predicting trends in SPY 🧵 Image
@PradeepBonde If you're a swing trader, then the Market Monitor is a seriously useful market breadth tool.

I'm using it to understand whether the market is favorable to long-based momentum trades.

Basically, I'm looking to trade long when the chart is highlighted green. Image
@PradeepBonde Of course, labelling areas as green/red is easy in hindsight.

What's harder is knowing that before it happens.

So, what the aim is here is to use the Market Monitor to predict when green/red happens.

We are going to use machine learning for this.
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(