anshuman Profile picture
Nov 3 16 tweets 4 min read Read on X
"Just use OpenAI API"

Until you need:
- Custom fine-tuned models
- <50ms p99 latency
- $0.001/1K tokens (not $1.25/1K input)

Then you build your own inference platform.

Here's how to do that:
Most engineers think "build your own" means:
- Rent some GPUs
- Load model with vLLM
- Wrap it in FastAPI
- Ship it

The complexity hits you around week 2.
Remember: You're not building a system to serve one model to one user.

You're building a system that handles HUNDREDS of concurrent requests, across multiple models, with wildly different latency requirements.

That's a fundamentally different problem.
What you actually need:
> A request router that understands model capabilities.
> A dynamic batcher that groups requests without killing latency.
> A KV cache manager that doesn't OOM your GPUs.
> A model instance pool that handles traffic spikes.

And that's just the core components.
Your <50ms p99 requirement breaks down as:

- Network overhead: 10-15ms (you can't fix this)
- Queueing delay: 5-20ms (if you batch wrong, this explodes)
- First token latency: 20-40ms (model dependent)
- Per-token generation: 10-50ms (grows with context length)

You have maybe 5ms of slack. This is why "just throw H100s at it" fails.
btw get this kinda content in your inbox daily -



now back to the thread - fullstackagents.substack.comImage
The first principle of inference platforms:

Continuous batching ≠ Static batching

Static batching waits for 8 requests, then processes them together. Continuous batching processes 8 requests and adds request #9 mid-generation.

vLLM does this. TensorRT-LLM does this. Your FastAPI wrapper doesn't.

This single difference is 3-5x throughput.
KV cache memory makes things difficult.

Llama 70B at 4K context needs 560GB of KV cache for just 32 concurrent requests. Your H100 has 80GB total.

PagedAttention (from vLLM) solved this by treating KV cache like virtual memory. Manual implementation? You'll OOM before you understand why.
"We have 20 fine-tuned models for different tasks"

Now your platform needs model routing based on user intent.

Dynamic loading and unloading so you don't keep 20 models in memory.

Shared KV cache across similar base models.

LoRA adapter swapping in <100ms.

This is where 90% of DIY inference platforms die.
Use OpenAI API when you're under 100K requests/month, using standard models, can tolerate 500ms+ latency, and cost per request is 10x higher than raw compute.

Build your own when you have custom models, doing 500K+ requests/month, need sub-100ms p99, or when cost optimization actually matters.

The break-even is usually around $5-10K/month in API spend.
Let's do the actual math:

OpenAI GPT-5 pricing: $1.25 per 1M input tokens, $10 per 1M output tokens

1M requests × 1K input tokens × 500 output tokens = $1,250 input + $5,000 output = $6,250

Your H100 inference platform at $2/hour: 1M requests at 100 req/sec = 2.8 hours = $5.60 in compute.

But you forgot engineering time ($50K to build), maintenance ($10K/month), and the 6 months to break even.
Production inference platforms have four layers:

Request handling (load balancer, rate limiter, queue). Orchestration (model router, dynamic batcher, priority scheduler). Inference engine (vLLM/TRT-LLM, KV cache manager, multi-GPU coordinator). Observability (per-component latency, GPU utilization, cost per token).

Most engineers build layer 1 and 3, then wonder why production breaks.
The mistakes that kill DIY inference platforms:

> Ignoring queueing theory. Your GPU isn't the bottleneck - your queue is. Requests pile up faster than you can batch them.

> Optimizing throughput over latency. Sure you hit 1000 tokens/sec in aggregate, but user experience is terrible because individual requests wait.

> Not measuring per-token latency. Your p99 looks fine until you realize tokens 50-100 are taking 200ms each.
Here's where it gets interesting: speculative decoding, prefix caching, and continuous batching work AGAINST each other.

Speculative decoding wants more compute upfront for faster generation. Prefix caching wants more memory to reuse common contexts. Continuous batching wants shorter sequences for better throughput.

Optimize one, degrade the others. This tradeoff doesn't exist when you're just calling OpenAI's API.
The production checklist for inference platforms:

> Use continuous batching (vLLM or TensorRT-LLM, not raw PyTorch).
> Implement request prioritization from day one.
> Monitor per-component latency, not just end-to-end.
> Auto-scale based on queue depth, not CPU.
> Track both $/token AND tokens/sec.

Have model hot-swapping ready. Plan for 10x traffic spikes.
That's it for today.

Building an inference platform is a 6-month engineering project with hidden costs everywhere.

But when you hit scale? It pays for itself in weeks.

The key is knowing when to build vs when to rent.

See ya tomorrow!

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with anshuman

anshuman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @athleticKoder

Oct 4
You're in a Machine Learning Interview at Perplexity, and the interviewer asks:

"Why do we need rerankers in RAG? Isn't semantic search enough?"

Here's how you answer:
Don't say: "To get better results" or "To improve accuracy."

Too vague.

The real answer is the two-tower bottleneck.

Your embedding model creates separate vectors for query and document.

No interaction = no understanding of relevance.
You're ranking without reading.
Here's why pure vector search fails:

Your bi-encoder computes similarity as dot(query_vec, doc_vec).

But "How to prevent heart attacks?" and "Heart attacks kill millions" have high cosine similarity - yet one is a question, the other a statistic.

Semantic ≠ Relevant.
Read 13 tweets
Sep 29
You’re in a AI Engineer interview at Microsoft, and the interviewer asks:

‘Our team needs to build RAG over 10M documents. Which vector database and why?’

Here’s how you answer:
Don’t say: ‘Pinecone scales best’ or ‘Chroma is easiest.’

Wrong framing.

The real answer isn’t about features - it’s about matching DB architecture to your query patterns.

Read-heavy prototype vs. write-heavy production vs. hybrid search = completely different databases.
btw if you want to read this kinda content daily, consider subscribing my free newsletter -



now back to the thread -fullstackagents.substack.com
Read 11 tweets
Sep 24
You're in a ML Inference Engineer interview at Google, and the interviewer asks:

"Our team wants to switch from Gemini API to a fine tuned. Which serving framework and why?"

Here's how you answer:
Don't say: "vLLM is fastest" or "Ollama is easiest."

Wrong framing.

The real answer isn't about features - it's about matching serving philosophy to your constraints.

Local prototype vs. production scale vs. complex workflows = completely different frameworks.
btw if you want to read this kinda content daily, consider subscribing my free newsletter -



now back to thread -fullstackagents.substack.com
Read 11 tweets
Sep 18
You're in a ML Engineer interview at Perplexity, and the interviewer asks:

"Your RAG system is hallucinating in production. How do you diagnose what's broken - the retriever or the generator?"

Here's how you can answer:
Most candidates say "check accuracy" or "run more tests."

Wrong approach.

RAG systems fail at TWO distinct stages, and you need different metrics for each.

Generic accuracy won't tell you WHERE the problem is.
The fundamental insight:

RAG quality = Retriever Performance × Generator Performance

If either component scores zero, your entire system fails. It's multiplication, not addition.

You can't compensate for bad retrieval with a better LLM.
Read 16 tweets
Sep 17
You're in a ML Engineer interview at Groq, and the interviewer asks:

"How do you measure LLM inference performance? What metrics matter most for production systems?"

Here's how you can answer
Most candidates fumble here because they only know "tokens per second" or TPS.

Incomplete answer.

There are 4 critical metrics every ML engineer should understand cold.
1. Time to First Token (TTFT) - The make-or-break metric

This is how long users wait before seeing ANY response.

Gemini? <300ms.
GPT-4o? ~200ms.

Your prototype? Probably 2+ seconds. Time to Frist Token Metric Diagram  Source: BentoML
Read 13 tweets
Sep 16
You're in a ML Inference engineer interview at Google, and the interviewer asks:

"What's the real bottleneck in LLM serving throughput? How can PagedAttention help?"

Here's how you can answer:
Traditional LLM serving hits a memory wall fast. The problem isn't compute - it's how we manage the KV cache.

65% model weights
30% KV cache
5% activations.

When KV cache is managed poorly, you're wasting 60-80% of your GPU memory. Image
The PagedAttention Breakthrough:

vLLM's PagedAttention solves this by borrowing from operating systems.

Just like OS uses virtual memory with paging, PagedAttention splits KV cache into blocks that don't need to be contiguous.

Memory fragmentation drops to near zero Image
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(