Read on Twitter

Ian Goodfellow @goodfellow_ian

, 25 tweets, 4 min read Read on Twitter

Thread on how to review papers about generic improvements to GANs

There are a lot of papers about theoretical or empirical studies of how GANs work, papers about how to do new strange and interesting things with GANs (e.g. the first papers on unsupervised translation), new metrics, etc. This thread isn't about those.

There are also a lot of papers about GANs as part of a larger system, like GANs for semi-supervised learning, differential privacy, dataset augmentation, etc. This thread also isn't about those---evaluate them in terms of the larger system's application area.

This thread is about new methods that are meant to generically make GANs train more reliably or produce better samples, etc.

My #1 recommendation is that reviewers of GAN papers should read "Are GANs Created Equal?" ( arxiv.org/pdf/1711.10337… ) for an explanation of why empirical work in this area is hard and how to do it right

Another good paper to read for background is "A note on the evaluation of generative models" ( arxiv.org/abs/1511.01844 ) which explains why it is possible to have models with great samples and bad likelihood or vice versa and other issues with metrics for generative models.

One difficulty with GAN papers is assessing novelty. There are so many proposed GAN improvements that it's hard to keep track of them all and tell if a new method is really new. Run Google searches for 4-5 ways of rephrasing the idea to see if it has already been proposed.

One good resource for keeping up with many GAN variants is the GAN zoo: github.com/hindupuravinas…

If a proposed method isn't really new, the paper might still be worthwhile, but reviewers should make sure the paper properly acknowledges the previous work.

As far as metrics go, Frèchet Inception Distance (or the intra-class version of it) is probably the best metric available today for just evaluating generic GAN performance. For datasets other than ImageNet it makes sense to use models other than Inception to define the distance.

Some papers that focus on special cases might be able to *include* other metrics (e.g. GANs with a Real NVP generator can actually report exact likelihood) but if a paper *excludes* FID I would expect it to make a good case about why

A lot of papers encourage the reader to form their opinion of the method mostly by looking at the samples. This is usually a bad sign.

The main way I know to use samples to make a case that there is an improvement is to generate samples from a domain that no one has been able to solve with previous techniques.

For example, generating ImageNet samples with a single GANs was very hard, with many papers showing basically failed attempts. SN-GAN succeeded in making recognizable samples from all the classes. From this we know SN-GAN was a major improvement.

(It's still possible that the improvement comes from factors other than the proposed method, like a new, bigger architecture, etc.)

Many papers show samples from datasets like CIFAR-10 or CelebA and ask the reviewer to be impressed by them. For these I'm never really sure what I'm meant to be looking for. The tasks are mostly solved so they've mostly lost signal for me.

I also don't really know how to rank images with one kind of minor defect against other images with a qualitatively different kind of minor defect---is it better to have a touch of wobble or a touch of checkerboarding, etc.?

Because of this, I don't generally regard CelebA, CIFAR-10 samples, etc. as anything more than a sanity check that the method isn't broken.

Reviewers should be very suspicious of anyone who has implemented their own baseline. There are a lot of subtle ways to screw up deep learning algorithms and authors have an incentive not to check their own baseline very carefully.

Usually, at least one of the baselines should be a result published in another paper, where the authors of that other paper had some incentive to get a good result. This way the evaluations are at least incentive-compatible.

Reviewers should check whether other papers have implemented models that perform the same task and check their scores. It's pretty common to cite a paper and then show worse images / scores than the paper actually reported.

view original on Twitter

view original on Twitter

Of course other fields have trouble with sandbagging the baseline too:

External Tweet loading...
If nothing shows, it may have been deleted
by @goodfellow_ian view original on Twitter

But I feel like it's particularly bad for GAN papers.

Sometimes if a paper studies a new task or a rarely evaluated aspect of a previously studied task, it is necessary for the authors to implement their own baseline. In this case, maybe as much as half of the paper should be devoted to demonstrating the baseline is correct

It is extremely important to explain where all hyperparameters came from. Often new methods just seem like improvements because the authors spent more time informally optimizing the hyperparameters for the new method

Achievement unlocked: max twitter thread length. I will continue in another thread

Like this thread? Get email updates or save it to PDF!

Subscribe to Ian Goodfellow

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Like this thread? Get email updates or save it to PDF!

Subscribe to Ian Goodfellow

This content may be removed anytime!

Try unrolling a thread yourself!

More from @goodfellow_ian see all

Related threads

Trending hashtags

Did Thread Reader help you today?