In other words, you’re trying to predict whether they’ll like the movie.
The Qs might go: 1. Do you like sci-fi movies? 2. If yes, are you okay w movies w some violence? 3. If yes, do you like Jeff Goldblum? 4. If you don’t like Goldblum, do you like Laura Dern?
And so on.
After a while, you think you have a questionnaire that at the end, will be able to decide if you should recommend them to try Jurassic Park.
One class of machine learning algorithm, called decision trees, makes “questionnaires” kind of like this.
Except if a data scientist at a big streaming company decides what the Qs should be, they have access to millions of users’ worth of watching histories, including whether they watched & finished (or started & didn’t finish) Jurassic Park or similar movies (like the sequels).
They write (or use pre-written) code to comb through the data to see which questions are most effective at correctly guessing whether the resulting recommendation matches their actual viewing preferences.
The algorithm can make a MASSIVE number of educated guesses for the right Qs to ask by processing the data in matrix form, then keeps iterating to minimize a mathematical formulation of the error (or the penalty for a wrong result).
This is an example of machine learning.
The result could be really good set of questions that a human would need pretty good cinema knowledge to match.
And the power is SCALE: the machine can then do this for every single user on the platform, and for any single movie.
And it can do this without any real knowledge about movies!
The algorithm has never watched a single movie; it's just good at "learning" that asking about sci-fi movies and Jeff Goldblum (maybe, I don't actually know) helps when guessing if someone will like Jurassic Park.
This is *one type* of machine learning problem called classification (predicting a categorical outcome, in this case whether a user will “like”, or click on and watch, Jurassic Park).
And the example here involves a tree-based algorithm.
This example also has structured data (viewing logs of users, and labels about the content they viewed), and the algorithm must self-tune to find the right sets of questions to achieve the best results.
There are many approaches to making good movie recommendations (e.g. you could use the similarity of movies or similarity of user behavior), and this is a simplified example.
There are many other kinds of ML problems and algorithms.
For example,
- finding patterns like clusters of “also watched” movies
- finding topics & themes & sentiment in the script
- identifying on-screen text from handwritten text and signage
- facial recognition of actors
And it's probably occurred to you already that you can use these outside of the movie/streaming industry.
There are algorithms all over the internet (and irl) who are trying to decide what you might be interested in—all based on your behavioral and other data.
Algorithms will try (are trying) to predict
- whether a credit card transaction is fraudulent
- whether you’ll click on an ad and buy
- whether you'll default on a loan
- what your political leanings are
- whether you have (or want) young children
Are you uncomfortable yet?
Most ML algorithms have a few things in common:
- They are trained and executable on massive amounts of data. (More on the training aspect later!)
- The “machine” doesn’t understand the actual subject matter.
…
…
- The "machine" relies on a mathematical abstraction of the objective, and of how well it’s achieving the objective.
- It doesn't need to explain its choices.
- It doesn’t necessarily know about inaccuracies or biases in the data.
I hope that was helpful in conceptualizing what ML is, and in thinking about the places where it could be used.
I'll talk later about some practical considerations that ML practitioners think about, like the importance of training data and model testing.
I'll also devote a thread (or a few) to the important topics of bias in ML/AI models, the MANY open ethical questions around the application of this technology—including how to use it responsibly (and if it can be used responsibly), and who bears that responsibility.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Tuesdays are usually my most eventful days of the week, by design.
I have 9 meetings today, the earliest at 11 AM and the latest ending at 10:30 PM.
And you will ask: "Why, Taka, WHY in the name of reason and science would you do this to yourself?"
1. We have a regular call of all the department leads. That includes people in Korea, US ET + PT. Right now, we have it in the evening my time.
2. To minimize the number of evenings I'm in meetings, I put the rest of my meetings with Korea teams back-to-back with the above.
3. I also put cross-team meetings with stakeholders on Tuesdays, during the US day. These only happen every other week… but why also put them on Tuesdays!?
There are THREE reasons why I do this to myself (and my team, what a monstrous boss!).
Recent data: ~80% of technical roles in the biggest tech companies are held by men. wired.com/story/five-yea…
Further: ~92% of Fortune 500 CEOs are men, and I have yet to meet a female CTO.
Let me preface my answer to the question with Angela Davis’ famous quote: “In a racist society, it is not enough to be non-racist, we must be anti-racist.”
The point applies not only to racism, but to systemic inequity of all forms.
Users can read stories for free. For most stories, they'll hit a wall after several chapters. They can wait for the next chapter to unlock in an hour, or you can use in-app currency to read it right away.
Authors can publish their content, and we share the revenue it generates.
We have about 50 employees, about half of whom are in the US and half of whom are in Korea.
The DS team has 5 members, so we're a pretty big fraction of the company.
I’ll share a little bit here about my work day, throughout the day as I find time.
Here’s my “office,” a cramped corner of a bedroom where I’ve been doing all my work for almost a year—including interviewing for and hiring my teammates at the current job.
You will notice that my desk blocks the dresser door. It’s just as well—it’s not like I need blazers, suits, or ties these days!
Also, thank goodness for Zoom backgrounds! (I prefer astro images or Totoro for mine.)
My team starts the week with a 10AM sync on Monday.
How were our weekends? What are we working on this week, and what’s coming up? Anything holding us back? What are we looking forward to? Anyone taking days off?
It’s a lot of work and a lot of responsibility to be accountable for the company’s entire DS practice, and for people’s jobs and professional growth. It’s not for everyone! There are weeks when I don’t get to code at all.
You don’t have to be a manager or dept head to grow your career. You can be principal, lead, or senior data scientist, and you can accept some mentoring responsibilities but fight away the managerial and strategic ones.
(Sometimes it isn’t clear you have this choice in small companies. But in my experience, small companies also offer the most flexibility for crafting your role & growth.)