Santiago Profile picture
Apr 17 11 tweets 3 min read Twitter logo Read on Twitter
Using AI is fun, but don't stop there.

The real leverage comes from learning how to build these models. Who do you think will come on top in the next 5 years?

Here is where I'd focus if I were to start from the beginning:
99% of people start talking about Machine Learning and Mathematics way too early.

I'd recommend you start with Python and one of the skills that most people ignore:

Web scraping.
How do you think companies are training their Large Language Models? Where do you think the data come from?

Web scraping allows you to get data from any public website on the internet.

If you know how to get data, you'll have the edge over everyone else.
Here is how you scrape a website:

1. Request the website URL.
2. Identify the location of the data in the HTML code.
3. Parse and extract that data.
4. Convert the data into a structured format.
Python developers can focus on these two libraries:

1. Playwright: To automate browser activities.
2. BeautifulSoup: To parse HTML documents.

You will also need to know HTML.
There is one challenge with web scraping:

Websites can block your IP to prevent you from scraping their public data.

There are three ways you can deal with this:

1. Slow down your crawling
3. Use a dynamic IP address
3. Using proxy servers
If you slow down your requests to the site, you may not get blocked.

If you use a dynamic IP instead of a static one, you may not get blocked.

But the only sure way to avoid getting blocked when collecting a lot of data it is to use proxies.
The easiest solution I've found is @bright_data's Scraping Browser API.

You can use it to collect data from any website. It's fast and scalable without worrying about proxies, CAPTCHAs, or other blockers.

I wrote an example. Image
You can use @BrightData and web scraping for much more than collecting data to build AI models.

Here are a few more use cases:

1. Businesses can scrape the marketplace to identify counterfeiters.

2. Analyze the performance of your competitors' social media campaigns.
3. Collect businesses’ financial status from public resources to calculate credit rating scores.

4. Manufacturers collect retailers’ prices to ensure they follow pricing guidelines.

5. Scrape competitors' prices to understand how to price your products.
I wrote this thread in partnership with @bright_data.

Their toolkit has everything you need if you want to do web scraping seriously.

You can get started here: get.brightdata.com/scrapingbrowse…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Santiago

Santiago Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @svpino

Apr 18
AI is eating the world.

This AI assistant can automate repetitive tasks with a one-sentence prompt:
@bardeenai is an automation platform.

They just released the "Magic Box," a feature that lets you build a workflow by writing what you want.

Before, it took 2 minutes to build a simple automation workflow, but you can do it now in 10 seconds!

That's a 1,100% improvement!
Here is a killer example:

You can ask the AI assistant to scrape Twitter.

60 seconds later, you have a collection of profiles with the content they posted in the last few days.

Artificial Intelligence at its best!
Read 5 tweets
Apr 16
What's the "Hello World" of machine learning?

Here are 4 problems I'd recommend if you want to start from the beginning:
Problem 1

The Titanic Challenge in Kaggle is among the most popular problems for beginners.

This is a classification problem that deals with structured data. It will show you the impact of good feature engineering.
Problem 2

For a regression problem, you'll find the House Pricing challenge in Kaggle.

The cool thing about these problems is that thousands of people have solved them, so you won't get stuck!
Read 5 tweets
Apr 14
Photography will never be the same.

Forget ChatGPT.

In 10 minutes, you can turn your photo gallery into unlimited, amazing pictures. For free!

How much imagination do you have? Image
Follow these steps to generate your photos:

1. Find a few photos of you. The more, the merrier.
2. Go to tryleap.ai and get an API KEY.
3. Run the code in the notebook below (Upload your photos first.)

Here is the code: colab.research.google.com/drive/1v45UprB…
The code is dead simple:

1. It fine-tunes a model with your photos.
2. Waits for the process to finish.
3. Generates a photo of you following a prompt.

Run this, and you'll have as many photos of you as your imagination will let you.
Read 4 tweets
Apr 13
Here is a quick introduction to the fundamental building block behind large language models:

Word and sentence embeddings. Image
The Internet is mainly text.

For centuries we've captured most of our knowledge using words, but there's one problem:

Neural networks hate text.
Turning words into numbers is more complex than you think.

The simplest approach is to use consecutive numbers to represent each word in our vocabulary:

• King → 1
• Queen → 2
• Prince → 3
• Princess → 4
Read 14 tweets
Apr 12
The one million dollar problem:

How do you keep a Machine Learning model returning accurate predictions over time?

Companies spend absurd amounts of money yearly to solve this, yet many have no idea where to start.

Here are three examples:
A model that uses stock prices to make predictions needs updates every second as prices changes.

Netflix's recommendations change as frequently as you and the people around you watch more movies.

Amazon's sales predictions change when customers spend money on the site.
These problems have something in common:

You have a billion samples but can't use them all for training your model. Instead, you need a slice of the data.

Most people will tell you to get a random slice, but that doesn't work.
Read 7 tweets
Apr 5
Health insurance in the US is broken.

I wonder how many people don't quit their jobs to build their businesses because they worry about health insurance.

We need to fix this nonsense.

Here is how much I pay:
For a family of four, I pay $17,500 USD every year.

That includes health and vision. It doesn't include dentist (that's a separate bill.)

But that's just the beginning.
I have a deductible of $6,500, and insurance doesn't kick in until I pay that amount.

But we are a family of four. We all have a $6,500 deductible.

If my wife and I get sick, we pay $6,500 each before insurance kicks in.

That's an extra $13,000 out of pocket.
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(