When applying pruning to InceptionV3, for example, we can make this model 75% more sparse and get this:
Accuracy 78.1% -> 76.1%!!
Model size: 27M -> 6.8M!!!!
Pruning is compatible with Quantization and their benefits add up!
5/6🧵
Do you use any of these techniques yet?
Do you have any stories with them to share?
I have some more tips for optimizing but that's for later.
Don't forget to follow and share so you don't miss them!
6/6🧵
• • •
Missing some Tweet in this thread? You can try to
force a refresh
When you have your TensorFlow Model and want to use it on a mobile device, you'll need to convert it to the TFLite format.
This process can be done in two ways:
- Using the Python API
- Using a command line tool
Let's look into some more details….
1/6🧵
Why do we need to convert?
The TFLite is an optimized format (Flatbuffer) for faster loading
To keep the framework lite and fast, all the Operations are optimized for mobile execution but not all TF operations are available
Usually we imagine Machine Learning models running on expensive servers with lots of memory and resources
A change that is enabling complete new types of apps is executing ML models on the edge, like on phones and microcontrollers
Let's find out why that matters
[2min]
1/6🧵
Why would you want to run a ML model on a phone?
Lower Latency: if you want to get inference on real time, running a model over a cloud API is not going to give a good user experience.
Running locally is much better and potentially much faster ⚡️
2/6🧵
When running ML models on-device, your app will be able to keep working even without network connectivity
For example, if your app translates text, it will still work in another country, when you need most and it won't use your hard earned money on roaming fees!