Machine learning for finance can be tricky!
So let's explore with another step-by-step case study 🧵
This time, let's use ML to find time series patterns across multiple instruments.
The aim - to find like patterns in the market and build a scanner.
Enjoy!
OK, what ML tools are we using here?
This will be a clustering approach = unsupervised.
Unsupervised means we give the algorithm no guidance on what to look for.
I'm interested to see what patterns IT can find for me.
Yes, I'm lazy.
Algos like K means clustering don't really have a concept of a time-series
They're great at other things, of course, but I want to find time series patterns.
So let's consider a special kind of clustering.
Time-series clustering, using tslearn.
1. Download and Feature Engineer
For now let's keep it simple:
• 3 tickers - just for exploring the idea
• yfinance - free daily data
• 3 features - Close prices, range (%), open to close (%)
You can experiment with other features & instruments later
2. Feature Reshaping
Time series algos require data in the shape = (example #, feature length, feature #)
So first, let's create new cols in our df, one per feature & position in the time series
Since here I want a 50 length time series over 3 features, I'll have 150 new cols
2a. Feature Reshaping
Also, I assume that adjacent examples split apart by one day will be similar - therefore, let's undersample by selecting every 5th row.
This may save us some training time and still retain most of the variance in the examples.
2c. Feature Reshaping
With all these new cols, we now need to reshape into a 3D array
This is achieved with the attached code
I will say, this whole reshaping nonsense is tricky to wrap your head around! So don't worry if it's not immediately obvious 🙂
3. Scaling
Two of our features are % changes -> already scaled.
However, closing prices are not scaled!
Why do we need to scale?
We're aiming to say "this time series is similar to that one" - it could be tricky to judge this if they're completely different on price scales.
3a. Scaling
Once prices are on the same scale, seeing if one curve looks like another is easier
Since our goal is to find time-series similarity, scaling is a good idea
How do we scale?
Let's use standard scaling - the time series have a mean of 0 and standard deviation of 1
3b. Scaling
Why this approach?
It tends to be a little less sensitive to outliers.
But at the end of the day, it's worth trying other ways too - because they could be better for your particular problem.
4. Clustering
Clustering is basically a way to sort data into groups.
However, we decide how many groups we want before we start
Therefore, the big question is - how many groups do we need?
4a. Clustering
Enter the elbow method
Basically, we fit by increasing K by 1 each time
Each fit gives a value which measures how compact the groupings are.
Lower the value = better fit.
Or does it?
4b. Clustering
If we push it too far, every point is it's own cluster, which isn't very useful.
On the other hand, having too many clusters might lead to overfitting, and we want our model to be more general.
The kneed package sorts this out for us, choosing the K at the elbow
4c. Clustering
The whole fitting process took me 153 seconds - not too bad!
The optimal k was 5.
We can also see that cluster 1 is the most popular in both train and test.
However, right now, this tells us nothing - let's view some charts!
5. Analyse the Clusters
Let's pick 4 random charts from cluster 0 - the highlight is the clustered point.
Clearly, it's picking some similarity - looks to me as if they're all in strong downtrends.
However, are they strikingly similar?
IMO - not really. Let's improve it.
6. Improvement
My hypothesis is that there are simply more than 5 ways the time series can form in the market.
So perhaps the elbow method isn't too helpful here.
Instead - let's up the granularity and find many more clusters.
6a. Improvement
I chose 50 - you may ask why. I have no intelligent response.
The main reason was to just be more granular, and see if it helps!
After all, it's all exploration for now.
So, did it help?
Let's explore with some charts.
6b. Improvement
Since 10 was the most popular cluster, I decided to take 4 random charts from that.
The results are much better now.
There's definitely more similarity - all in uptrends with a pause in momentum.
Can we do better again?
7. Dynamic Time Warping
Enter dynamic time warping (DTW).
DTW is a way of comparing two time series a little more robustly than other distance measures.
Let's break it down and explain why this is helpful.
7a. DTW
In pointwise matching, we would aim to get a distance measure by summing all the distances between each point.
However, if we shift the time series, the diffs on the y-axis may get large.
It's the same series, but the cost would be high.
7b. DTW
In DTW the points are matched up differently.
It's done in a way so that their y distance is smaller - it compares the two, equal time series efficiently even though they are shifted.
I won't go full into the details here, because that's easily another 🧵
7c. DTW
By the way, this also works on time series of different lengths, which is a nice perk (as not all time-series patterns are created equal).
The outcome of the method is a single number, the warping cost.
Smaller cost = more similar the time series.
7d. DTW
Why am I explaining all this?
tslearn allows you to use dtw in the metric keyword argument.
With this, we should expect a more robust time series comparison!
8. Cluster Analysis - DTW
Cluster 30 is the most popular here, let's check out some charts.
These look even more similar to me - strong uptrend followed by a momentum pause.
IMO - it's super neat that an algo can find this, with absolutely no direction!
8a. Cluster Analysis - DTW
However - this increased performance comes at a cost.
DTW is not a particularly parallelisable algo, meaning it's really slow.
In fact this case took me 3-5 business days to run.
I joke, it was actually 1971 seconds (30 minutes...)
9. Conclusions
So what did we learn here?
tslearn can be used to group like patterns in the market, without any direction, and completely on time-series data.
Euclidean does a good job, DTW a better one - but at the cost of it being really slow.
9a. Conclusions
It's better asking for more granularity, the elbow method is not particularly helpful for us here.
However, how granular still remains a question!
A topic for further research.
9b. Conclusions
Remember, this was on 3 listed stocks, 3 simple features, and to be fair, they're all similar ones at that (i.e. large tech stocks).
It's not the most "robust" approach, but illustrates the idea and should be easy to extend.
Anyway, that's enough rambling for today.
The code has been committed to my git repo, the link is in my bio.
And, If you are interested in applying ML/data-science to finance then follow me @DrDanobi for more threads like these!
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.