One of the first things I was looking for when I got into dspy was to combine it with offline vllm batch inference.
But the whole dspy stack is built on single calls, supporting retries and asynchronicity etc.
Still, I wanted to be able to use dspy easily with performant locally hosted models.
After much fiddling and tinkering here and there, I found the special incantation to make vllm and dspy work together. It was a bit too long to just share the snippet of code, so I wrapped it up into a library. It's a 1 file init.py 500 LOC library, it should not be too hard for me (us :D) to maintain. It is quite powerful!
In my performance test, vllm directly was running my task in 65 seconds; through dspy, it's 68 seconds.
So here it is:
uv venv
source .venv/bin/activate
uv python install 3.12 --default
uv pip install ovllm
ps: vllm 0.10.0 has a dependency that does not work with Python 3.13 for now.
Wouldn't it be nice if I could use an llm for autocomplete instead of small dumb copilot/cursor type models?
Speed + Quality of Kimi-K2 on Groq makes it possible!
So, in 1 hour, I vibe coded a vscode extension, just before starting my day at work. Here is how 🧵
ps: I only took about ~5 prompts.
1. For the first prompt, I opened a fresh conversation in with all 4: o3-pro, gemini 2.5-pro, opus 4 extended thinking and grok 4-heavy.
the prompt was:
I want to be able to press a keyboarb shortcut in vscode and the whole content of my current code (plus the other opened tab), plus instructions to a open ai end point would be sent to them and the ai would automcomplete and add code for the 'section'. the ai will decide what section its confident to predict how should we do it?
2. For the second prompt, I took all 4 output of the first prompt and put them all together in 1 text file and I copy pasted all that into a fresh chat for all four, with that prompt:
I asked 4 llm: I want to be able to press a keyboarb shortcut in vscode and the whole content of my current code (plus the other opened tab), plus instructions to a open ai end point would be sent to them and the ai would automcomplete and add code for the 'section'. the ai will decide what section its confident to predict how should we do it? below are the responses, considering plus your own judgement. Help me make that work. I want to use groq this is a snippet they recommend: from groq import Groq import { Groq } from 'groq-sdk'; const groq = new Groq(); const chatCompletion = await groq.chat.completions.create({ "messages": [ { "role": "user", "content": "" } ], "model": "moonshotai/kimi-k2-instruct", "temperature": 0.6, "max_completion_tokens": 4096, "top_p": 1, "stream": true, "stop": null }); for await (const chunk of chatCompletion) { process.stdout.write(chunk.choices[0]?.delta?.content || ''); } I will be fiddling with the right prompts, just focus on the mechanics. My main goal is to use llms (because they are Now fast enough as automcomplete instead of cursor tab or github copilot type of things).
I needed to understand exactly what DSPy was sending to the LLMs on my behalf before I could trust it.
If you are like me, just run adapter.format yourself (the default is ChatAdapter) and you will see exactly what is happening. If you do not like the result, implement your own. DSPy is modular and fully supports that.