Read on Twitter

Emily M. Bender @emilymbender

, 100 tweets, 35 min read Read on Twitter

Currently* livetweeting: Anton van den Hengel’s keynote “Deep Neural Networks, and some things they’re not very good at"
#ACL2018

@twitter

@twitter

(*provided @twitter doesn't lock me out again)

@Twitter

@Twitter

@Twitter van den Hengel:
My group works mainly on computer vision, grew from about 5 people in 2007 to 110 and growing now.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
Used NNs for a wide variety of things. Slide with long list showing accomplishments to establish authority before showing problems with deep learning.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
Bifurcation of computer vision from rest of AI (incl NLP) has been a mistake perpetrated by computer vision people.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
For a long time in computer vision, if you used any other info especially text, you’d get your CVPR paper rejected & told to go to multimedia conf. I hope we’re getting over that mistake.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
Visual QA is a great way to get over that.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
Most CVPR papers are DL now — most people are trying to use DL as a solution every problem that we know. Not a positive outcome, but an indication that the technology is fantastic.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
DL works well for a fixed challenge with a lot of fixed training data. Great for benchmarks.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
I’ve never been to an NLP conf before, so please excuse any horrendous mistakes.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
In computer vision, we’ve been very focused on benchmarks, driving incredible performance improvements.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
DL works well for benchmarks. But that doesn’t mean it works well for everything. Problems if the problem changes, the data changes, the information changes, or what you seek isn’t describable in a fixed no. of examples.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
We already know how to do lots of things, learned over prev decades, most of which seems to have been thrown away in the past 5 years.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
Do you really want to train a multiplier? You can, but why?
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
Part of what happened in the bifurcation is that computer vision moved away from AI.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
But computer vision brings together information and the real world. That’s one of the core AI problems.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
But generalization from toy problems to most real data isn’t feasible.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
But deep learning puts us in a position to make some real progress.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
Shows histogram of the number of years until we’ll have true AL, irrespective of when experts were asked. Peak: 20 years. It explains a lot, really.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
Shout out to Daniel Kahneman thinking fast & thinking slow. (Psychologists consider this pop psychology; true duel process theories much more sophisticated, of course.)
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
Duality is helpful, because it has something to say about where DL fits into intelligence: it does the associated process, and does it well.
#ACL2018

@Twitter

@Twitter

@Twitter van den Hengel:
DL is associated processing — look-up, nearest neighbor look-up in high dimensional space, and almost not generalization. Close to what we do subsconciously.
#ACL2018

van den Hengel:
We spent so much time in ML talking about generalization bounds, and it turns out that maybe generalization wasn’t such a good idea.
#ACL2018

van den Hengel:
Traditional intelligent agents: signal processing > likelihoods over an ontology -> rule based system -> action.
#ACL2018

van den Hengel:
Rule-based systems don’t have to be a whole bunch of if-then-else clauses. (EMB: Indeed!!)
#ACL2018

van den Hengel:
Tried and true. Advantage: Interpretable by humans. Disadvantage: Can’t really learn anything.
#ACL2018

van den Hengel:
Critical question is how info flows around this thing. At the moment one way: from sensing to reasoning.
#ACL2018

van den Hengel:
Can’t update to tell it afterwards that there are no blue mammals, or that the rules for overtaking a school bus have changed. Or tell the car that that thing in front of it is an elephant.
#ACL2018

van den Hengel:
Problem of ontology v. phenomenology. In computer vision, a fixed ontology makes a whole lot of sense. ImageNet: 1000 images per concept/object type, with goal to keep adding and adding.
#ACL2018

van den Hengel:
But that’s nonsense because the number of classes isn’t fixed, and the relationship between observations and classes isn’t one to one.
#ACL2018

van den Hengel:
Problem is bridging the gap between deep learning and reasoning / learning on the fly and learning by instruction.
#ACL2018

van den Hengel:
Graphic with implicit associative <-> explicit symbolic reasoning, with prolog as an example of the latter (just an example).
#ACL2018

van den Hengel:
Like learning to drive as humans. We start doing it all very explicitly and then move to doing it much more automatically. (While still doing explicit/conscious processing of intentions of where we’re going.)
#ACL2018

van den Hengel:
Driving as a model for a lot of AI problems that we face. But CNNs don’t do a good job of it—only the associative part. They’re really good at signal processing (better than humans), but that’s not all that’s requried.
#ACL2018

van den Hengel:
Do we really want to train a multiplier? I hope you find that so obvious (that the answer is no) that you’re even wondering why I even bring it up.
#ACL2018

van den Hengel:
But this is an argument we keep having with the younger generation, who have only one solution to use: build a bigger network.
#ACL2018

van den Hengel:
Cite to a paper showing that with a big enough network you can train it to do addition, even if there’s noise in the images … up to 7 digits.
#ACL2018

van den Hengel:
A form of addition where you can add noise to the images isn’t a particularly useful thing to have.
#ACL2018

van den Hengel:
What it doesn’t do well is multiplication … need a much deeper network.
#ACL2018

van den Hengel:
The whole point of the paper is that it’s a really bad idea. We already know how to do multiplication. Training a network to do it doesn’t make sense.
#ACL2018

van den Hengel:
“Just build a bigger network” doesn’t scale, particularly combinatorily. ImageNet is one label per image. But the truth is that images typically have many many things in them.
#ACL2018

van den Hengel:
Other counter-argument: CNNs are Turing complete, so they can do anything. True, but not helpful. Just because they’re Turing complete doesn’t make them a good way to do everything.
#ACL2018

van den Hengel:
Just not true that with enough training data you can solve any problem.
#ACL2018

van den Hengel:
VQA is a fantastic challenge, because it brings together natural language and images. It’s where the rubber hits the road.
#ACL2018

van den Hengel:
Brings together information (in language) and the real world (in images). You can take a picture of almost anything. (EMB aside: DISAGREE on that last point.)
#ACL2018

van den Hengel:
Training data is images + questions + answers. Can give the trained system questions it’s never been asked before: What is the mustache made of? (Bananas.) Will work, even though no one’s trained a mustache detector.
#ACL2018

van den Hengel:
Great because: on the path to AI and can still publish as CVPR, and it’s the opposite of chess, go, Atari games, and it’s a step towards being able to learn, visually.
#ACL2018

van den Hengel:
If we can do VQA, then maybe we can do dialogue, and then maybe we can learn from the real world. (EMB aside: If VQA then dialogue seems like a huge jump to me.)
#ACL2018

van den Hengel:
Canonical VQA architecture: CNN for image, RNN for text, RNN for resulting vectors from each.
#ACL2018

van den Hengel:
Q: Is this a fish or a bicycle?
A: No.
#ACL2018

van den Hengel:
Counting questions — hard and interesting. How many horses are in that image? It’ll say 2, regardless of how many there are. Because that’s the most frequent answer to “how many” questions.
#ACL2018

van den Hengel:
How many unicorns are in this photo? (Picture of two horses: two. Picture of soccer game: two.)
#ACL2018

van den Hengel:
Did this player win the point (photo of tennis)? 1. Yes. 2. Tennis court.
#ACL2018

van den Hengel:
[More funny examples in that vein]. VQA systems can’t deal with anything that’s not right in the pixels. NLP does much more complicated stuff routinely.
#ACL2018

van den Hengel:
Traditional NLP QA methods are complex. (Image of system architecture for Watson.) VQA is relying on the answer being represented in the question.
#ACL2018

van den Hengel:
In VQA we’re not even covering the training data. Standard practice was to only use the 1000 most popular answers and do a 1000-way classifier on them.
#ACL2018

van den Hengel:
“How many mammals are in this image?” Having a mammal detector is a bad idea. Dog, cat, bird detectors make sense.
#ACL2018

van den Hengel:
If you want to be able to answer the Q above, need to bring in other information, something which has been taboo in computer vision for a while.
#ACL2018

van den Hengel:
If you’re going to bring it in, how? NNs aren’t very good at that.
#ACL2018

van den Hengel:
One thing we did: Bring in DBpedia (processed version of Wikipedia). Feed detections from image into DPpedia, which gives back text, which you can add to LSTM that processes question & answer. (Complex system diagram.)
#ACL2018

van den Hengel:
Method generates its own caption, uses that caption to go into DBpedia, primes the LSTM with returned text, and can give better answers.
#ACL2018

van den Hengel:
Much better would be explicit reasoning — use a knowledge base: fact-based VQA. KB in the form of RDF tuples.
#ACL2018

van den Hengel:
KB only has a tiny % of all info you’d want. <Obama, president, USA> but not <everything, gravity, everything>
#ACL2018

van den Hengel:Make RDF database of what comes out of the image detectors. Every giraffe detection in there links to giraffe in wikipedia (effectively) and then you can chain through wikipedia: giraffe>ungulate>mammal. Can tell that the thing you’ve detected is an animal!#ACL2018

van den Hengel:
Neural Turning Machine (from DeepMind) solves some of the problems with SparQL. Can basically train the machine to do CS 101 problems. An LSTM connected to associated memory (not neural-net type memory).
#ACL2018

van den Hengel:
API for the memory, and the LSTM learns how to use that API.
#ACL2018

van den Hengel:
We came up with a VQA machine which similarly learns how to use the VQA training set to teach the VQA machine to use other algorithms—things we already know how to do from computer vision (counting, segmentation…)
#ACL2018

van den Hengel:
Great thing you get out of that process: A reason for the answer, in the form of an attention map for each word.
#ACL2018

van den Hengel:
Another complex system diagram — again it’s a horrible, complex, CNN…
#ACL2018

van den Hengel:
Ask a different question about the same image, get different attention.
#ACL2018

van den Hengel:
Thus far all of the methods have had fixed “reasoning” (SPARQL for binary chaining, attention also has limitaitons), and none can answer questions about anything new.
#ACL2018

van den Hengel:
Next: Learning to reason. Input is question, image, and a separate set of info it can use to answer the question. Info needed separated from VQA process.
#ACL2018

van den Hengel:
Then at test time, give it the info it needs to answer the question (EMB: and it applies what it’s learned about reasoning)
#ACL2018

van den Hengel:
Big advantage: Can ask questions about things it didn’t see in training AND it can give answers that it didn’t see in training. There’s no fixed ontology.
#ACL2018

van den Hengel:
Based on a meta-learning approach, with dynamic prototypes. And the best bit is that it works. Not as well as over-fitting to the dataset in the first place.
#ACL2018

van den Hengel:
VQA is a great excuse to take real steps towards AI, but we all need to move on to more interesting questions. Also need better metrics for VQA, which means solving a problem that somebody cares about.
#ACL2018

van den Hengel:
Example: Vision and language navigation—give a robot an instruction that relates to the world around it and have it follow those directions.
#ACL2018

van den Hengel:
Conclusion: It’s a wonderful time to be working on this stuff, and I hope that VQA becomes a way that our two fields can work more closely together.
#ACL2018

van den Hengel:
Q (Dost?): Old-school AI was focusing on toy tasks & hoping to scale. But for real world, don’t you need too much data? Maybe small datasets but more complex in structure would be a good solution.
#ACL2018

van den Hengel:
Coming up with the right datasets is critical. Problem in computer vision: We’re very focused on datasets that are (v. simple) discriminative tasks, rather than generative ones. Need datasets that ask systems to do higher level things.
#ACL2018

van den Hengel:
Q (John Patrick): One field where everything you talked about comes together: Surgery and medicine. Want robotics, have documentation, very significant ontology, lots of training data inside institutions. Do you have any news on where that’s going to go?
#ACL2018

van den Hengel:
A: Scanners — but human in, run it through system trained on data, and gives labels.
#ACL2018

van den Hengel:
What you’d really like is a scanner where you can ask how many people who had this feature lived for the next 5 years and what’s the distinction between the two sets? Fantastic application area, trying to figure out how to create that dataset.
#ACL2018

van den Hengel:
A to question I missed: I think that unsupervised learning is the only hope, because there’s too much the system needs to know. But unsup. learning isn’t just jamming more image, etc through a deep NN.
#ACL2018

van den Hengel:
Need to do more like people were doing in knowledge engineering, to build systems that can learn from the images.
#ACL2018

van den Hengel:
You can learn about tennis from images w/o that, but not in a way that scales to squash, badminton, ping-pong…
#ACL2018

van den Hengel:
Q: You say it doesn’t make sense to train to do arithmetic, but have an external symbolic system for that. Do you think it makes sense to do that for … facts in the world?
#ACL2018

van den Hengel:
A: Real value to giving the system access to that kind of knowledge, but not sure what that source will be. And there’s also a whole lot of knowledge that won’t be in that ever. What it looks like when birds fly, how to catch a ball…
#ACL2018

van den Hengel:
We all have lots of info that we learned by observation/experience that can never be written down.
#ACL2018

van den Hengel:
Your whole talk reminded me of Pierce’s (EMB: not sure name is right) work at Bell Labs — advocate of Shannon and Chomsky, chaired the ALPAC report, and really didn’t like AI. Really really didn’t like pattern matching, I think for reasons you point out.
#ACL2018

van den Hengel:
Your whole talk reminded me of Pierce’s (EMB: not sure that name is right) work at Bell Labs—advocate of Shannon and Chomsky, chaired the ALPAC report, and really didn’t like AI. Really didn’t like pattern matching, I think for reasons you point out.
#ACL2018

van den Hengel:
Q: Is it possible your field is ready to take those ideas up again, and stop working on what’s easy to solve?
#ACL2018

van den Hengel:
A: I agree with almost all of that. Everything good was in Bell Labs, associative reasoning is very stupid. But: It’s the best thing we’ve got.
#ACL2018

van den Hengel:
I’m not the only one in computer vision saying that deep learning has its limitations and that we need to look at a different class of problems.
#ACL2018

van den Hengel:
There’s a movement happening, though largely among those who have been around long enough to remember a time before deep learning.
#ACL2018

van den Hengel:
Q: Should representation itself be symbolic or distributed? (Cite to Jerry Fodor.) If a system should be come more smart, what kind of representation is going to be useful?
#ACL2018

van den Hengel:
A: That’s a key question. I think you’ll have to have both implicit and explicit representations. Humans have implicit representations… if you’ve forgotten how to walk, I can’t tell you how.
#ACL2018

van den Hengel:
But there’s also a whole bunch of stuff that’s represented explicitly. If you want to know that the president of the US is a particular person (EMB: taboo avoidance?), if you want to know the equations for gravity…
#ACL2018

van den Hengel:
I’m pretty sure we’re going to need both.
#ACL2018

Like this thread? Get email updates or save it to PDF!

Subscribe to Emily M. Bender

This content may be removed anytime!

Try unrolling a thread yourself!

Trending hashtags

Like this thread? Get email updates or save it to PDF!

Subscribe to Emily M. Bender

This content may be removed anytime!

Try unrolling a thread yourself!

Related hashtags

More from @emilymbender see all

Related threads

Trending hashtags

Did Thread Reader help you today?