Language models like GPT-3 and Codex can generate code, but they can miss your intent and their code can have bugs. Can we improve that? Perhaps guarantee the absence of certain errors? Come checkout Synchromesh at #ICLR2022 tomorrow!
We start by identifying two broad classes of mistakes that these models can make:
1- Conceptual errors, when they miss or ignore parts of the specification
2- Implementation errors, where their output can fail to parse, type-check, execute, or violate other desirable constraints
Conceptual errors are highly influenced by which examples we give these models in their prompt. Few-shot examples can bias the model in either the right or wrong direction. It's often possible to get the output we want by just giving better examples.
Given a training bank that doesn't fit entirely in the prompt, how should we select the most relevant examples?
GPT-3 was originally tested with random examples. Subsequent papers have explored using similarity as measured by a natural language paraphrase model.
So given "Plot a histogram of movie running times", an example with input "show me number of cars grouped by their weight" should be very relevant - the code will be identical except for column names.

But... that's not the similarity that vanilla paraphrase models care about.
As a result, they will often fill the prompt with irrelevant examples just because they have more similar nouns, or don't recognize that "histogram" and "bar chart with counts" are likely the same thing.
We propose Target Similarity Tuning: fine-tuning the similarity model to predict how similar are the target outputs given their natural language descriptions. TST often retrieves much better examples and can alone improve results quite drastically! Example of TST. The same query, "which city has the hig
Even then, there's no guarantee on the code that the model can generate. For example, when generating SQL, it might use database columns that don't exist, mess up table aliases (which it defines itself in the generated code), try impossible joins or aggregations...
Or if generating data visualizations in Vega-Lite, it might facet on a real-valued column with 10k distinct values, which is technically not wrong but will make the renderer try to allocate too much memory and crash, and give the user a sad error message instead of their plot...
We can prevent all that with Constrained Semantic Decoding, a technique we propose for only allowing the model to generate from a set of valid programs.

This is done by combining decoding with a constraint engine that will reason about incomplete programs.
We propose the abstraction of a Completion Engine, which can parse the partial program and output an arbitrary regular expression for the next tokens. Once the model outputs something that maximally matches that regexp, the engine is called again.
We derive syntactic completions for free from the language's grammar. The CE can further apply semantic, context-sensitive constraints given the partial AST and the user's context (e.g., the database/dataframe).

Many rich constraints can be encoded in just a few lines of Python. Table with example constraints that can be encoded in CSD. I
But enforcing these constraints during sampling is not trivial. The programming language and the language model can have arbitrarily misaligned tokens, and regular lexers/parsers don't like to deal with incomplete tokens (like an open quote without a closing quote). Example of Vega-Lite code, highlighting that Byte-Pair Encod
That's the job of our CSD algorithm! Given a completion engine, it will sample from the language model while enforcing constraints, making sure to only call the completion engine at its own defined token boundaries, but letting the model output long BPE tokens as it normally does
CSD uses a neat trick using Brzozowski derivatives of regular expressions to do that efficiently.

While one could just force the model to output certain tokens (e.g., { instead of {"), that makes the model degenerate since that's not how the training set or prompt were tokenized
With Synchromesh (CSD + TST), we get improvements in accuracy and in generating valid programs in 3 languages: SQL, Vega-Lite and SMCalFlow. Synchromesh's impact is also larger for generating larger programs. Table with accuracies for GPT-3 and Codex when paired with C
This is all done without fine-tuning these large models. We implemented everything on top of the public OpenAI API!
This was done during my internship last year with @ProseMsft and fantastic collaborators at @MSFTResearch: @Skiminok, Vu Le, Ashish Tiwari, @gustavoas, Chris Meek, @SumitGulwani

arxiv.org/abs/2201.11227

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Gabriel Poesia

Gabriel Poesia Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(