Post

https://x.com/eugeneyan/status/1909444357058707960

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @eugeneyan

Eugene Yan

@eugeneyan

Oct 30, 2024

Evaluating LLM output is hard. It's the bottleneck to scaling AI products for many teams

A key mistake is defining eval criteria w/o actually LOOKING AT THE DATA. This leads to irrelevant / unrealistic criteria + wasted effort

Thus, I built AlignEval AlignEval.com

The key insight: In addition to aligning AI to human context (prompting, rag), we also need to calibrate human criteria to actual AI output.

By working backwards from the data, AlignEval helps you build better evals.

Screenshots & how it was built here: eugeneyan.com/writing/aligne…

AlignEval simplifies building LLM-evaluators:

• Upload a csv with columns for input, output, and optionally, label
• LOOK AT YOUR DATA and label it with pass/fail
• Define eval criteria, run LLM-evaluator, eval the evaluator
• Improve your LLM-evaluator with "Optimize Mode"

Read 6 tweets

Eugene Yan

@eugeneyan

Jun 18, 2023

Our paper club recently revisited some of the earlier language modeling papers. Here's a one-liner for each.

---

Attention: Query, Key, and Value are all you need*

*Also position embeddings, multiple heads, feed-forward layers, skip-connections, etc

arxiv.org/abs/1706.03762

GPT: Decoder is all you need.

(Also, pre-training + finetuning 💪)

cdn.openai.com/research-cover…

BERT: Encoder is all you need. Left-to-right language modeling is NOT all you need.

(Also, pre-training + finetuning 📈)

arxiv.org/abs/1810.04805

Read 8 tweets

Eugene Yan

@eugeneyan

May 8, 2023

Ran a simple benchmark (Mandelbrot sets) between Mojo & Python. The speedup is impressive, and it benefits from Python's libraries.

• Python: 1,184ms
• Mojo: 27ms 🤯
• Python (vectorized): 240ms
• Mojo (vectorized): 2ms

Here's a GitHub gist if you want to try it yourself: gist.github.com/eugeneyan/1d2e…

(Couldn't download the notebook for some reason)

https://twitter.com/jeremyphoward/status/1653924474536984577

Also hear what Jeremy Howard has to say about Mojo

https://twitter.com/jeremyphoward/status/1653924474536984577

Read 5 tweets

Eugene Yan

@eugeneyan

May 7, 2023

An insider's view on China's scale and tech, the 996 work ethic, and Alibaba's acquisition of Lazada. corecursive.com/software-world…

Years later, I'm still boggled by the scale and how we had to use a completely different tech stack (spoiler alert: it's mostly Ali Java).

Yea, there were one-click deploys, rollbacks, A/B tests—you name it.

Also, there were SQL commands that were both powerful and scary (and borderline questionable 🙈). Any data analyst on the street became a median data scientist.

The work ethic was punishing. Burnout became more common. While most Asians could endure it, folks from cultures that emphasized work-life balance struggled.

Read 5 tweets

Eugene Yan

@eugeneyan

Apr 11, 2023

Over the past few weekends, I've experimented with using LLMs to build a simple assistant.

Here's a write-up of what I built, how I built them, and their potential. Also, some shortcomings with embedding retrieval, with solutions from search & recsys.

eugeneyan.com/writing/llm-ex…

https://twitter.com/eugeneyan/status/1637562031233708032

Here's my first project dabbling with LLMs for the humble `summarize`

https://twitter.com/eugeneyan/status/1637562031233708032

https://twitter.com/eugeneyan/status/1640160729537073152

Moving on to using tools, specifically SQL and search

https://twitter.com/eugeneyan/status/1640160729537073152

Read 5 tweets

Eugene Yan

@eugeneyan

Apr 3, 2023

@LangChainAI

This weekend, I had a blast building a personal board of advisors using BeautifulSoup, @LangChainAI , and @pinecone.

`/board` provides advice to questions on tech, leadership, and life. It also provides links to sources for further reading!

`/ask-ey` does something similar for my own site, eugeneyan.com. And because I'm more familiar with my own writing, I can better spot shortfalls such as not answering based on a source when expected, or when a source is irrelevant.

A high-level overview:
• Scrape content from board of advisors (requests, BeautifulSoup)
• Embed content aka sources (OpenAI text-embedding-ada-002)
• Embed queries & find similar sources (Pinecone)
• Provide sources as context for the LLM to synthesize a response (LangChain)

Read 15 tweets

Share this page!

Enter URL or ID to Unroll

Eugene Yan

Try unrolling a thread yourself!

More from @eugeneyan

Eugene Yan

Eugene Yan

Eugene Yan

Eugene Yan

Eugene Yan

Eugene Yan

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!