Tweet

Gabriel Poesia

Apr 25 • 19 tweets • 5 min read

Language models like GPT-3 and Codex can generate code, but they can miss your intent and their code can have bugs. Can we improve that? Perhaps guarantee the absence of certain errors? Come checkout Synchromesh at #ICLR2022 tomorrow!

We start by identifying two broad classes of mistakes that these models can make:
1- Conceptual errors, when they miss or ignore parts of the specification
2- Implementation errors, where their output can fail to parse, type-check, execute, or violate other desirable constraints

Conceptual errors are highly influenced by which examples we give these models in their prompt. Few-shot examples can bias the model in either the right or wrong direction. It's often possible to get the output we want by just giving better examples.

Given a training bank that doesn't fit entirely in the prompt, how should we select the most relevant examples?

GPT-3 was originally tested with random examples. Subsequent papers have explored using similarity as measured by a natural language paraphrase model.

So given "Plot a histogram of movie running times", an example with input "show me number of cars grouped by their weight" should be very relevant - the code will be identical except for column names.

But... that's not the similarity that vanilla paraphrase models care about.

As a result, they will often fill the prompt with irrelevant examples just because they have more similar nouns, or don't recognize that "histogram" and "bar chart with counts" are likely the same thing.

We propose Target Similarity Tuning: fine-tuning the similarity model to predict how similar are the target outputs given their natural language descriptions. TST often retrieves much better examples and can alone improve results quite drastically!

Even then, there's no guarantee on the code that the model can generate. For example, when generating SQL, it might use database columns that don't exist, mess up table aliases (which it defines itself in the generated code), try impossible joins or aggregations...

Or if generating data visualizations in Vega-Lite, it might facet on a real-valued column with 10k distinct values, which is technically not wrong but will make the renderer try to allocate too much memory and crash, and give the user a sad error message instead of their plot...

We can prevent all that with Constrained Semantic Decoding, a technique we propose for only allowing the model to generate from a set of valid programs.

This is done by combining decoding with a constraint engine that will reason about incomplete programs.

We propose the abstraction of a Completion Engine, which can parse the partial program and output an arbitrary regular expression for the next tokens. Once the model outputs something that maximally matches that regexp, the engine is called again.

We derive syntactic completions for free from the language's grammar. The CE can further apply semantic, context-sensitive constraints given the partial AST and the user's context (e.g., the database/dataframe).

Many rich constraints can be encoded in just a few lines of Python.

But enforcing these constraints during sampling is not trivial. The programming language and the language model can have arbitrarily misaligned tokens, and regular lexers/parsers don't like to deal with incomplete tokens (like an open quote without a closing quote).

That's the job of our CSD algorithm! Given a completion engine, it will sample from the language model while enforcing constraints, making sure to only call the completion engine at its own defined token boundaries, but letting the model output long BPE tokens as it normally does

CSD uses a neat trick using Brzozowski derivatives of regular expressions to do that efficiently.

While one could just force the model to output certain tokens (e.g., { instead of {"), that makes the model degenerate since that's not how the training set or prompt were tokenized

With Synchromesh (CSD + TST), we get improvements in accuracy and in generating valid programs in 3 languages: SQL, Vega-Lite and SMCalFlow. Synchromesh's impact is also larger for generating larger programs.

This is all done without fine-tuning these large models. We implemented everything on top of the public OpenAI API!

@ProseMsft

This was done during my internship last year with @ProseMsft and fantastic collaborators at @MSFTResearch: @Skiminok, Vu Le, Ashish Tiwari, @gustavoas, Chris Meek, @SumitGulwani

arxiv.org/abs/2201.11227

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Gabriel Poesia

People who liked this thread also liked...

Try unrolling a thread yourself!

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?