Latest Twitter Threads by @asmah2107 on Thread Reader App

Jan 4 • 10 tweets • 2 min read

You're in a ML interview at Anthropic, and the interviewer asks:

"Why is scaling the context window to 100k tokens so hard? Why not just add more memory?"

Here is how you answer to prove you aren't a junior dev.

Bookmark this

🧵👇

Don't say: "It makes the model too slow."

That's the rookie answer.

The real answer is the Quadratic Attention Bottleneck and the KV Cache Explosion.

Standard attention is O(N^2).

Doubling context doesn't double the work, it quadruples it.

Dec 6, 2025 • 15 tweets • 4 min read

You're in a ML Interview at Meta, and the interviewer asks:

"How do we serve Llama-3 to 1,000 concurrent users?
Why do we run out of memory even if the model fits on the GPU?"

Here's how you answer to get that $$$ :

Don't say: "Just use a bigger GPU."

That's a junior answer.

The real bottleneck in LLM serving isn't the Model Weights. It's the KV Cache.

When you generate tokens, the memory consumption grows linearly with sequence length.

For a 13B model, weights are ~26GB.

But a long context request?

The KV cache can easily exceed the weights!

Dec 4, 2025 • 8 tweets • 2 min read

I love discussing : "Design a system for 1 million users."

Most jump to tech: "I'll use Kafka, Kubernetes, blah blah!"

Game over 🥵

I can tell you this is a red flag.

The best engineers start with a question, not a technology.

A question that reveals if they truly understand scale... The magic keyword is Access Pattern.

A million users reading a news article is a completely different universe than a million users hitting the "like" button.

One is a broadcast problem (one-to-many).

The other is an ingestion problem (many-to-one).

Treating them the same is the #1 reason scalable systems fail.

Oct 18, 2025 • 8 tweets • 2 min read

I love discussing : "Design a system for 1 million users."

Most jump to tech: "I'll use Kafka, Kubernetes, blah blah!"

Game over 🥵

I can tell you this is a red flag.

The best engineers start with a question, not a technology.

A question that reveals if they truly understand scale...🧵 The magic keyword is Access Pattern.

A million users reading a news article is a completely different universe than a million users hitting the "like" button.

One is a broadcast problem (one-to-many).

The other is an ingestion problem (many-to-one).

Treating them the same is the #1 reason scalable systems fail.

Sep 26, 2025 • 8 tweets • 2 min read

You are in a MLE interview at Perplexity

and the interviewer asks

"Our RAG system is failing. It's pulling the right documents from our knowledge base but the final answer is still factually wrong. What do you do?"

You say: "The LLM is hallucinating. We need to improve our prompt or use a better model."

Instant red flag.

Here's how you break it down 👇 You see the flaw?

"The LLM is hallucinating" is lazy.

You've ignored half of the system. The problem isn't just the generator, it's the interaction between the retriever and the generator.

Here's the diagnostic framework for fixing broken RAG:

Deconstruct the Prompt:

The final prompt sent to the LLM is (user_query + retrieved_context).

The bug is often here.

Sep 25, 2025 • 7 tweets • 1 min read

You're in a ML interview at Meta

The interviewer asks :

"How do you choose a vector database?"

You reply :
"I'd benchmark recall and latency."

That's only 20% of the problem.

Here's how you break it down 🧵: A vector database isn't just a search algorithm, it's a production data system. You should evaluate it like a database, not just a library.

The wrong choice will crush you with operational costs, not slow queries.

Sep 23, 2025 • 9 tweets • 2 min read

A friend of mine recently bombed MLE interview at NVIDIA, they asked:

"We need to deploy a Llama-3 70B model on hardware with limited VRAM. You propose quantization. When is this a bad idea?"

Here's how you break it down: Most candidates say: "Quantization is great, it makes models faster and smaller by using lower-precision numbers like INT8 or FP8. It's a win-win."

This answer misses the entire point of the question. Quantization is a trade-off, and if you don't know the risks, you will break production.

Jul 17, 2025 • 12 tweets • 2 min read

If I had to start learning system design from scratch right now from first principles…

I would do this ->

(100% unfiltered advice) Listen,
You only need curiosity everyhing else is noise.

The moment you do it to just crack interviews, you are doomed.

Stop right there.

Jun 23, 2025 • 7 tweets • 2 min read

Your user database is breached. The hackers have the users data.

If you stored passwords in plain text, it's game over.

If you stored them as MD5 or SHA hashes, you might feel safe. You are not.
Why is the "standard" hashing approach a trap? 🧵 Here comes Salted Hashing.

The flaw in simple hashing is that the same password always produces the same hash. "password123" will always hash to ef92b778bafe771e89245b89ec5b8c55.
Hackers don't crack these one by one. They use pre-computed "Rainbow Tables" that map billions of common hashes back to their passwords.

Jun 21, 2025 • 10 tweets • 3 min read

You're tasked with monitoring your fleet of 10,000 servers.

"Have every server report its CPU usage every second to a central PostgreSQL database. You'll get a nice table of (timestamp, server_id, cpu_usage)."

This plan will work for about 15 minutes, until your database disk fills up and its write performance grinds to a halt. 🧵

The magic keyword is Time-Series Database (TSDB).
Relational databases are designed for complex relationships and transactional integrity. They are fundamentally the wrong tool for the firehose of data that is system metrics.
Your job isn't to store events. It's to efficiently store and query billions of timestamped numerical measurements.

Jun 16, 2025 • 8 tweets • 2 min read

Your database query is slow.

"Add an index."

Problem solved, right?

Let's talk about the hidden cost of speed…🧵 Most engineers think of an index as a magical cheat sheet for fast reads. They forget that this cheat sheet has to be updated every single time you write to the table.
Your job isn't just to make reads fast. It's to balance read performance with the cost of every INSERT, UPDATE, and DELETE.

Jun 14, 2025 • 9 tweets • 2 min read

How would you protect a public API from being overwhelmed ?

"I'd count the number of requests from a user's IP address and block them if it goes over a limit."

Simple, right ?

No...This simplicity can ruin your user experience....🧵 Flow Control is your friend here.

Most engineers think rate limiting is about punishing bad actors. It's not. It's about ensuring your service remains stable and responsive for all users by managing the flow of traffic.

Your job isn't to block users, it's to manage the pressure they put on your system. Read that again.

Jun 13, 2025 • 8 tweets • 2 min read

Everyone knows "splitting your database" helps scale.
Simple.

But most miss the terrifying truth, your first big "splitting" or "sharding" decision is practically permanent.

Getting it wrong doesn't just create a bug..it can silently doom your entire architecture and ruin your peaceful nights.

That's exactly why I love discussing about Database Sharding ...🧵 The magic keyword is the Sharding Key.

This isn't just a column you pick to split your data. It is the single most critical decision you will make. It dictates how your system will scale, where your bottlenecks will appear, and whether it can survive its own success.

Your job isn't just to split data. Your job is to choose a key that evenly distributes the future load.

Jun 13, 2025 • 7 tweets • 2 min read

In System Design, I love discussing how someone would scale a web application.

The first answer is always "add more servers and a load balancer." This is where the real fun begins.

It's not about which algorithm you choose (Round Robin, Least Connections). It's about whether you understand that a load balancer's real job is to hide the chaos of a distributed system from the user.

Until it can't...🧵 Your user logs into your app. Their request hits Server A. They browse for a bit.

Then, they click "View Shopping Cart."

The load balancer sends this new request to Server B, which is less busy. Server B has no idea who this user is. It asks them to log in again.

The user is furious. The system is broken.

Jun 12, 2025 • 7 tweets • 2 min read

I love discussing caching strategy in system design.

It’s not about Redis vs. Memcached. It’s about whether folks realise that adding a cache can create more problems than it solves.
Wait…what the…?

Yes, most people see caching as a free performance win. They forget that a cache's primary job is to lie to you by serving old data. The magic keyword is Cache Invalidation.

This is one of the hardest problems in computer science.

A cache without a clear invalidation strategy is not a performance tool; it's a bug.

Your job isn't just to store data to make reads faster. Your job is to define the precise moment that your stored data becomes a lie, and how you will purge it.

Jun 11, 2025 • 7 tweets • 2 min read

Let's talk about a classic system design journey.

Your app is getting popular, but the database is getting slow. You correctly identify that heavy read traffic is the problem. You add read replicas.

The database load drops. You celebrate.

A week later, users start complaining about their changes "disappearing."

What happened?

🧵 A user updates their profile picture. The write goes to the main database. Success.

They immediately reload the page. The read request hits a replica.

The replica hasn't received the update yet. The user sees their old picture.

The magic keyword is Replication Lag. It's the silent killer of user trust.

Jun 10, 2025 • 7 tweets • 2 min read

I love asking candidates about API design for a simple reason.

It’s not about REST vs. GraphQL. It’s about whether they can build systems that survive contact with the real world.

Basically Chaos.

🧵🔽 Your user taps 'Confirm Purchase'.

The network connection dies.

They tap the button again.

Have they just been charged twice?

The magic keyword is Idempotency.

This is the silent killer in distributed systems.

Most engineers think this is a dry, academic term for "you can call the same request twice."

Jun 29, 2023 • 9 tweets • 2 min read

Before joining Google, I appeared for 40+ interviews.

Failed many, cracked a few🔥

Here are my top 7 learnings 🧵

#interviews 1. Ask clarifying questions before diving into solutions.

Asking clarifying questions is extremely crucial and demonstrates active listening, ensures clear understanding, highlights problem-solving skills, and shows engagement and initiative.

Share this page!

Enter URL or ID to Unroll