Tweet

Philip Vollet

8 May, 11 tweets, 2 min read

To build a chatbot you need data for your intent classification.

But what if you have too little training data?

Paraphrasing is one option for augmentation

But what is a good paraphrase?

Almost all conditioned text generation models are validated on 2 factors:

1. If the generated text conveys the same meaning as the original context (Adequacy)

2. If the text is fluent / grammatically correct english (Fluency)

For instance Neural Machine Translation outputs are tested for Adequacy and Fluency.

But a good paraphrase should be adequate and fluent while being as different as possible on the surface lexical form. With respect to this definition, the 3 key metrics that measures the quality of paraphrases are:

1. Adequacy: Is the meaning preserved adequately?
2. Fluency: Is the paraphrase fluent English?
3. Diversity: Lexical / Phrasal / Syntactical → how much has the paraphrase changed the original sentence?

What makes a paraphraser a good augmentor?

For training a NLU model we just don't need a lot of utterances but utterances with intents and slots/entities annotated.

Typical flow would be:

Given an input utterance + input annotations a good augmentor spits out N output paraphrases while preserving the intent and slots.

The output paraphrases are then converted into annotated data using the input annotations that we got in step 1.

The annotated data created out of the output paraphrases then makes the training dataset for your NLU model.

But in general being a generative model paraphrasers doesn't guarantee to preserve slots/entities

So the ability to generate high quality paraphrases in a constrained fashion without trading off the intents and slots for lexical dissimilarity makes a paraphraser a good augmentor

Parrot

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines.

github.com/PrithivirajDam…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @philipvollet

Philip Vollet

@philipvollet

11 May

How to get your dream job in Data Science if you are a career changer?

First you have to sneak around HR and their antiquated methods. This is only possible through contacts or unusual ways.

But what are good ways?

The middleman

Someone who can hand over your application who has a connection to the company or someone who works there.

The direct way, but be careful this must be done well

You look for a contact person via Linkedin, but the pitch has to be right and you really have to have an interesting application. Otherwise it looks like spam and you are out of the game forever.

How?

Read 6 tweets

Philip Vollet

@philipvollet

10 May

Does BERT Pretrained on Clinical Notes Reveal Sensitive Data? • Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks.

Paper arxiv.org/abs/2104.07762
GitHub

↓ 1/4

github.com/elehman16/expo…

The cost of training such models and the necessity of data access to do so is coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT.

↓ 2/4

While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar).

Would it be safe to release the weights of such a model if they did?

↓ 3/4

Read 4 tweets

Philip Vollet

@philipvollet

7 May

How do you create a beautiful interface for your machine learning or data science project?

Handmade from scratch?
Any good tools?

Sure there are incredible tools:

@GradioML

Beautiful ML & DS interfaces

Gradio
Quickly create customizable UI components around your ML models. By dragging-and-dropping in your own images, pasting your own text, recording your own voice & seeing what the model outputs.

@GradioML

github.com/gradio-app/gra…

@plotlygraphs

Beautiful ML & DS interfaces

Dash apps bring Python analytics to everyone with a point-&-click interface to models written in Python, R & Julia - vastly expanding the notion of what's possible in a traditional dashboard.

@plotlygraphs

plotly.com/dash

Read 8 tweets

Philip Vollet

@philipvollet

6 May

https://twitter.com/GoogleAI/status/1390046279468830721

FELIX, a fast and flexible text-editing system that models large structural changes and achieves a 90x speed-up compared to seq2seq approaches whilst achieving impressive results on four monolingual generation tasks.

Abs aclweb.org/anthology/2020…

github.com/google-researc…

https://twitter.com/GoogleAI/status/1390046279468830721

Compared to traditional seq2seq methods, FELIX has the following three key advantages:

Sample efficiency: Training a high precision text generation model typically requires large amounts of high-quality supervised data.

FELIX uses three techniques to minimize the amount of required data:

(1) fine-tuning pre-trained checkpoints,
(2) a tagging model that learns a small number of edit operations, and
(3) a text insertion task that is very similar to the pre-training task.

Read 4 tweets

Philip Vollet

@philipvollet

28 Apr

@explosion_ai

One of the best videos I know when it comes to putting NLP into production.

With the power of spaCy v3 and the underlying thinc library for robustness and reproducibility btw. the declarative config system is unbeaten.

@explosion_ai

↓ 1/4

Learn more about spaCy v3.0 and its new features like: transformer-based pipelines, the new training config and workflow system to help you take projects from prototype to production.

STEP BY STEP

↓ 2/4

01:54 – State-of-the-art transformer-based pipelines
05:03 – Declarative configuration system
11:06 – Workflows for end-to-end projects
17:03 – Trainable and rule-based components
21:43 – Custom models in any framework
26:20 – Features and summary

↓ 3/4

Read 4 tweets

Philip Vollet

@philipvollet

26 Apr

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery - A text-based interface for StyleGAN image manipulation.

Abs arxiv.org/abs/2103.17249
GitHub github.com/orpatashnik/St…

They first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.

Next, they describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable textbased manipulation.

Read 4 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Share this page!

Philip Vollet

Try unrolling a thread yourself!

More from @philipvollet

Philip Vollet

Philip Vollet

Philip Vollet

Philip Vollet

Philip Vollet

Philip Vollet

Did Thread Reader help you today?

Like this author's thread?