Read on Twitter

12,399 views

Hady Elsahar

@hadyelsahar

, 40 tweets, 19 min read Read on Twitter

@meloncholist

@meloncholist

#ACL2019nlp is kicking off by "Latent structure models for NLP" ~ Andre F. T. Martins Tsvetomila Mihaylova @meloncholist @vnfrombucharest

The tutorial slides can be found here: deep-spin.github.io/tutorial/acl.p…

updates here 👇👇

#ACL2019nlp

@meloncholist

@meloncholist

@meloncholist @vnfrombucharest Andre is starting with a motivational introduction about some structured prediction tasks (POS tagging, Dependency parsing, Word alignment)

@meloncholist

@meloncholist

@meloncholist @vnfrombucharest * #NLProc before (pipelines) and after (end to end)

*end to end models learn latent continuous vectors that are useful for downstream tasks. which might not be as interpretable as structured hidden representations.

This motivates latent structure models which have a long history in #NLProc such as: HMM, CRF, PCFG
Those models are mostly trained in using EM with some strict assumptions.

tutorial is about:
structured methods motivated by some linguistic intuition Trained by RL methods, end-to-end in stochastic computation graphs.

Explaining probability simplex in unstructured. Their counterpart for structured case The Marginal polytope

Each vertex represents a binary vector which corresponds to a structured representation (e.g. dependency tree)

points inside the marginal polytope represent probabilities of binary vectors. Compuঞng the most likely structure is a very high-dimensional search space.

There are ways to deal with this large search space :
* Greedily incremental decisions shift-reduce, beam search ..etc
* Globally (e.g. using veterbi)which guarantees optimal solution while handling global and local constraints. only disadvantage they rely on hard assumptions

Nice explanation about why Argmax breaks the gradient flow. Thus backpropagation is not suitable for training networks directly with discrete latent structures.

There are several ways to deal with the problem of back-propagating through discrete decisions

1. Pre-train external classifier (over cont. latent)
2. Mulঞ-task learning
3. Stochastic latent variables
4. Gradient surrogates
5. Conঞnuous relaxaঞon (Gumbel softmax)

@meloncholist

@meloncholist

@meloncholist is now presenting reinforcement learning

@meloncholist

@meloncholist

@meloncholist The use-case used in this problem is a shift-reduce parser SPINN, trained in an "unsupervised way" where the only training signal is the downstream task (not directly training for parsing) and the shift-reduce decisions are a ?Modelled as a discrete latent structure "z".

@meloncholist

@meloncholist

@meloncholist Vanilla SPINN with REINFORCE fails to really learn syntax.
Mainly for two problems high variance and coadaptation.

@meloncholist

@meloncholist

@meloncholist High variance is caused by the large search space but exists only one correct answer.
"Control variates" a.k.a baselines are a common way to reduce the variance of reinforcement learning.

@meloncholist

@meloncholist

@meloncholist Proximal policy optimization is one of the ways to deal with the coadaptation issue.

REINFORCE+SPINN work using all those previous set of tricks however they don't work on complex English syntax.

@tsvetomila

@tsvetomila

Now Tsvetomila Mihaylova @tsvetomila is talking about gradient surrogates.

The straight-through estimator, the reparametrization trick, the Gumbel softmax trick.

@tsvetomila

@tsvetomila

@tsvetomila The straight-through estimator trick simply
is in the fwd path you pretend "z" is discrete so you can get discrete decisions. in the backward path you pretend "z" is continuous and propagate error gradients on it.

@tsvetomila

@tsvetomila

@tsvetomila Now:
* the reparameterization trick [Kingma and Welling, 2014]
* the gumbel softmax trick

I am going to skip details but here's a nice read about it:
casmls.github.io/general/2017/0…
and the original paper Jang et al. ICLR 2017
arxiv.org/pdf/1611.01144…

@tsvetomila

@tsvetomila

@tsvetomila Example for using ST: Gumbel-softmax trick

@tsvetomila

@tsvetomila

@tsvetomila Summary

* RL (REINFORCE) unbiased high variance estimators of gradient loss

* Gradient surrogates (Gumbel softmax): biased - low variance approximates of gradient loss

@tsvetomila

@tsvetomila

@tsvetomila @vnfrombucharest starts again after break about end-to-end differentiable relaxations

@tsvetomila

@tsvetomila

@tsvetomila @vnfrombucharest Smooth relaxations are ways to overcome the non-differentiability of the argmax operations making discrete latent variables. e.g. this can be done using the outputs of the softmax function instead of the argmax discrete choices.

@tsvetomila

@tsvetomila

@tsvetomila @vnfrombucharest Revisit Softmax: is the unique solution that maximizes expected score + an extra entropy to push the solution into the middle of the simplex.

@tsvetomila

@tsvetomila

@tsvetomila @vnfrombucharest Softmax, Sparsemax and lots of other generalizations in between.

@tsvetomila

@tsvetomila

@tsvetomila @vnfrombucharest Back again to the marginal polytope. comparison between structured and unstructured case:

@tsvetomila

@tsvetomila

@tsvetomila @vnfrombucharest As much as softmax approximates argmax. Marginals approximate MAP
Here are some examples from #nlproc literature for different structure prediction problems.

@tsvetomila

@tsvetomila

@tsvetomila @vnfrombucharest got lost a bit but here are some related references:
Structured Attention Networks Kim et al. 2017
arxiv.org/abs/1702.00887
Learning Structured Text Representations
arxiv.org/pdf/1705.09207…

Back propagating through marginals
pros:
* familiar to NLPers
* all computations are exact so no approximation
cons:
forward pass marginals are dense we cannot expect they give a tree, wrong paths might still have > 0 probs.
* back-prop through Dynamic programming is tricky

@tsvetomila

@tsvetomila

@tsvetomila @vnfrombucharest Sparse max counterpart in structured prediction is "SparseMAP"

SParse MAP
arxiv.org/pdf/1802.04223…

Most of the previous solutions included sampling, because the sum over all possible discrete choices (Trees) is intractable. So what are the ideas possible to redefine π (the parsing model) to avoid sampling.

It turns out that SparseMAP is a nice way to solve that because you need only to calculate the Expectation of the loss over a few trees with non-zero weights (which are yielded in SparseMAP).

Andre F. T. Martins is back again with conclusions.

Unsupervised training of parsing using signals from end-to-end downstream tasks & trained through any of the previously explained 3 class of approaches, doesn't (yet) converge to correct grammatical structures.

comment: a nice paper by (Chohen et al. 2019) about the "Weakness of RL for Machine translation" Where they show that observed gains may be due to effects
unrelated to the training signal. I wonder if this might also be why we don't converge yet to correct grammatical structures

Presentation is over now Q&A

@tsvetomila

@tsvetomila

Thanks for the cool tutorial 👏👏👏
Andre F. T. Martins
@tsvetomila
@meloncholist
@vnfrombucharest

Check also their paper Monday 1:50pm in the poster session.
"Sparse Sequence-to-Sequence Models"
arxiv.org/pdf/1905.05702…

https://twitter.com/hadyelsahar/status/1155422272364171269?s=09

https://twitter.com/hadyelsahar/status/1155422272364171269?s=09

Twitter Threads are messy
ful thread here.

https://twitter.com/hadyelsahar/status/1155422272364171269?s=09

@threadreaderapp

@threadreaderapp

@threadreaderapp unroll

Like this thread? Get email updates or save it to PDF!

Subscribe to Hady Elsahar

This content may be removed anytime!

Try unrolling a thread yourself!

Trending hashtags

Like this thread? Get email updates or save it to PDF!

Subscribe to Hady Elsahar

This content may be removed anytime!

Try unrolling a thread yourself!

Related hashtags

More from @hadyelsahar see all

Related threads

Trending hashtags

Did Thread Reader help you today?