12,399 views

Naren Thiagarajan

@narenst

, 12 tweets, 4 min read

My Authors

@martin_casado

@martin_casado

From a recent @martin_casado article posted on @a16z:

"Cloud infrastructure is a substantial – and sometimes hidden – cost for AI companies".

I'm sharing the techniques we use at @FloydHub_ to reduce this cost on #AWS and improve our gross margins [Thread] #ML #AI

0/ Not all AWS regions are priced the same. GPUs can be up to 90% more expensive across regions. Other than cost consider these when picking your AWS region: proximity to your geographical location, compliance requirements, and integration with any existing AWS infrastructure.

1/ Reserve your GPU instances and/or purchase Savings Plans. Review your GPU usage for the last 3-6 months and purchase 1-year plans based on that. Gives you 25-30% savings on your GPU bill.

2/ Use Spot GPU instances if you can pull that off. This gives a huge 55% savings on GPU cost but the machines may not always be available. Plus these instances can be preempted any time, so you need good model checkpointing and recovery mechanism.

3/ Monitoring machines to reduce wastage. Often overlooked but one of the biggest ways to cut cost. Identify machines that are running longer than expected and notify the team on Slack/email. After a few times, the team will remember to turn off their cloud machines when done!

4/ Intelligently auto-scale infra. Reduce the waiting time of data scientists wherever possible while balancing the cost of running extra machines. Remember the $ per hour of a Data Scientist is ~100x the $ per hour of a K80 machine.

5/ Use all tiers of S3 storage. Use a lifecycle management rule on S3 to shuttle unused data to less expensive and slower storage tiers. Eventually move really old datasets to Glacier which is 80% cheaper than standard S3.

6/ Use the Infrequent Access option in EFS. Just like S3, use a lifecycle policy to move less frequently used files to Infrequent Access. And the lower tier is 10x cheaper than standard EFS.

7/ Periodically purge unused data. ML training generates a ton of files for each run. If left unchecked, this has a continuously increasing storage need. So periodically purge data from training that is no longer required by the team.

8/ Use Multi-GPUs only if you really need the speed up in training. The cost of multi-GPU machines scales linear but performance does not. An 8-GPU machine costs 8 times a standard GPU machine but performs only 5 times better on neural net training. So mostly avoid it.

9/ Use AMIs specifically optimized for the hardware you are using. #Tensorflow can be recompiled to the specific CPU architecture that you are using and this can give up to 8x performance boost on CPU training. Similarly use GPU specific TF when running on GPU hardware.

I love geeking on #ML #infrastructure. I’m happy to brainstorm more ideas based on your AWS setup. Reply to this if you think it’ll be useful to you.

Enjoying this thread?

Keep Current with Naren Thiagarajan

Stay in touch and get notified when new unrolls are available from this author!

This Thread may be Removed Anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Enjoying this thread?

Try unrolling a thread yourself!

Related hashtags

Related threads

Trending hashtags

Did Thread Reader help you today?