#ACL2018
We spent so much time in ML talking about generalization bounds, and it turns out that maybe generalization wasn’t such a good idea.
#ACL2018
Traditional intelligent agents: signal processing > likelihoods over an ontology -> rule based system -> action.
#ACL2018
Rule-based systems don’t have to be a whole bunch of if-then-else clauses. (EMB: Indeed!!)
#ACL2018
Tried and true. Advantage: Interpretable by humans. Disadvantage: Can’t really learn anything.
#ACL2018
Critical question is how info flows around this thing. At the moment one way: from sensing to reasoning.
#ACL2018
Can’t update to tell it afterwards that there are no blue mammals, or that the rules for overtaking a school bus have changed. Or tell the car that that thing in front of it is an elephant.
#ACL2018
Problem of ontology v. phenomenology. In computer vision, a fixed ontology makes a whole lot of sense. ImageNet: 1000 images per concept/object type, with goal to keep adding and adding.
#ACL2018
But that’s nonsense because the number of classes isn’t fixed, and the relationship between observations and classes isn’t one to one.
#ACL2018
Problem is bridging the gap between deep learning and reasoning / learning on the fly and learning by instruction.
#ACL2018
Graphic with implicit associative <-> explicit symbolic reasoning, with prolog as an example of the latter (just an example).
#ACL2018
Like learning to drive as humans. We start doing it all very explicitly and then move to doing it much more automatically. (While still doing explicit/conscious processing of intentions of where we’re going.)
#ACL2018
Driving as a model for a lot of AI problems that we face. But CNNs don’t do a good job of it—only the associative part. They’re really good at signal processing (better than humans), but that’s not all that’s requried.
#ACL2018
Do we really want to train a multiplier? I hope you find that so obvious (that the answer is no) that you’re even wondering why I even bring it up.
#ACL2018
But this is an argument we keep having with the younger generation, who have only one solution to use: build a bigger network.
#ACL2018
Cite to a paper showing that with a big enough network you can train it to do addition, even if there’s noise in the images … up to 7 digits.
#ACL2018
A form of addition where you can add noise to the images isn’t a particularly useful thing to have.
#ACL2018
The whole point of the paper is that it’s a really bad idea. We already know how to do multiplication. Training a network to do it doesn’t make sense.
#ACL2018
“Just build a bigger network” doesn’t scale, particularly combinatorily. ImageNet is one label per image. But the truth is that images typically have many many things in them.
#ACL2018
Other counter-argument: CNNs are Turing complete, so they can do anything. True, but not helpful. Just because they’re Turing complete doesn’t make them a good way to do everything.
#ACL2018
VQA is a fantastic challenge, because it brings together natural language and images. It’s where the rubber hits the road.
#ACL2018
Brings together information (in language) and the real world (in images). You can take a picture of almost anything. (EMB aside: DISAGREE on that last point.)
#ACL2018
Training data is images + questions + answers. Can give the trained system questions it’s never been asked before: What is the mustache made of? (Bananas.) Will work, even though no one’s trained a mustache detector.
#ACL2018
Great because: on the path to AI and can still publish as CVPR, and it’s the opposite of chess, go, Atari games, and it’s a step towards being able to learn, visually.
#ACL2018
If we can do VQA, then maybe we can do dialogue, and then maybe we can learn from the real world. (EMB aside: If VQA then dialogue seems like a huge jump to me.)
#ACL2018
Canonical VQA architecture: CNN for image, RNN for text, RNN for resulting vectors from each.
#ACL2018
Counting questions — hard and interesting. How many horses are in that image? It’ll say 2, regardless of how many there are. Because that’s the most frequent answer to “how many” questions.
#ACL2018
How many unicorns are in this photo? (Picture of two horses: two. Picture of soccer game: two.)
#ACL2018
[More funny examples in that vein]. VQA systems can’t deal with anything that’s not right in the pixels. NLP does much more complicated stuff routinely.
#ACL2018
Traditional NLP QA methods are complex. (Image of system architecture for Watson.) VQA is relying on the answer being represented in the question.
#ACL2018
In VQA we’re not even covering the training data. Standard practice was to only use the 1000 most popular answers and do a 1000-way classifier on them.
#ACL2018
“How many mammals are in this image?” Having a mammal detector is a bad idea. Dog, cat, bird detectors make sense.
#ACL2018
If you want to be able to answer the Q above, need to bring in other information, something which has been taboo in computer vision for a while.
#ACL2018
One thing we did: Bring in DBpedia (processed version of Wikipedia). Feed detections from image into DPpedia, which gives back text, which you can add to LSTM that processes question & answer. (Complex system diagram.)
#ACL2018
Method generates its own caption, uses that caption to go into DBpedia, primes the LSTM with returned text, and can give better answers.
#ACL2018
Much better would be explicit reasoning — use a knowledge base: fact-based VQA. KB in the form of RDF tuples.
#ACL2018
KB only has a tiny % of all info you’d want. <Obama, president, USA> but not <everything, gravity, everything>
#ACL2018
Neural Turning Machine (from DeepMind) solves some of the problems with SparQL. Can basically train the machine to do CS 101 problems. An LSTM connected to associated memory (not neural-net type memory).
#ACL2018
We came up with a VQA machine which similarly learns how to use the VQA training set to teach the VQA machine to use other algorithms—things we already know how to do from computer vision (counting, segmentation…)
#ACL2018
Great thing you get out of that process: A reason for the answer, in the form of an attention map for each word.
#ACL2018
Thus far all of the methods have had fixed “reasoning” (SPARQL for binary chaining, attention also has limitaitons), and none can answer questions about anything new.
#ACL2018
Next: Learning to reason. Input is question, image, and a separate set of info it can use to answer the question. Info needed separated from VQA process.
#ACL2018
Then at test time, give it the info it needs to answer the question (EMB: and it applies what it’s learned about reasoning)
#ACL2018
Big advantage: Can ask questions about things it didn’t see in training AND it can give answers that it didn’t see in training. There’s no fixed ontology.
#ACL2018
Based on a meta-learning approach, with dynamic prototypes. And the best bit is that it works. Not as well as over-fitting to the dataset in the first place.
#ACL2018
VQA is a great excuse to take real steps towards AI, but we all need to move on to more interesting questions. Also need better metrics for VQA, which means solving a problem that somebody cares about.
#ACL2018
Example: Vision and language navigation—give a robot an instruction that relates to the world around it and have it follow those directions.
#ACL2018
Conclusion: It’s a wonderful time to be working on this stuff, and I hope that VQA becomes a way that our two fields can work more closely together.
#ACL2018
Q (Dost?): Old-school AI was focusing on toy tasks & hoping to scale. But for real world, don’t you need too much data? Maybe small datasets but more complex in structure would be a good solution.
#ACL2018
Coming up with the right datasets is critical. Problem in computer vision: We’re very focused on datasets that are (v. simple) discriminative tasks, rather than generative ones. Need datasets that ask systems to do higher level things.
#ACL2018
Q (John Patrick): One field where everything you talked about comes together: Surgery and medicine. Want robotics, have documentation, very significant ontology, lots of training data inside institutions. Do you have any news on where that’s going to go?
#ACL2018
A: Scanners — but human in, run it through system trained on data, and gives labels.
#ACL2018
What you’d really like is a scanner where you can ask how many people who had this feature lived for the next 5 years and what’s the distinction between the two sets? Fantastic application area, trying to figure out how to create that dataset.
#ACL2018
A to question I missed: I think that unsupervised learning is the only hope, because there’s too much the system needs to know. But unsup. learning isn’t just jamming more image, etc through a deep NN.
#ACL2018
Need to do more like people were doing in knowledge engineering, to build systems that can learn from the images.
#ACL2018
You can learn about tennis from images w/o that, but not in a way that scales to squash, badminton, ping-pong…
#ACL2018
Q: You say it doesn’t make sense to train to do arithmetic, but have an external symbolic system for that. Do you think it makes sense to do that for … facts in the world?
#ACL2018
A: Real value to giving the system access to that kind of knowledge, but not sure what that source will be. And there’s also a whole lot of knowledge that won’t be in that ever. What it looks like when birds fly, how to catch a ball…
#ACL2018
We all have lots of info that we learned by observation/experience that can never be written down.
#ACL2018
Your whole talk reminded me of Pierce’s (EMB: not sure name is right) work at Bell Labs — advocate of Shannon and Chomsky, chaired the ALPAC report, and really didn’t like AI. Really really didn’t like pattern matching, I think for reasons you point out.
#ACL2018
Your whole talk reminded me of Pierce’s (EMB: not sure that name is right) work at Bell Labs—advocate of Shannon and Chomsky, chaired the ALPAC report, and really didn’t like AI. Really didn’t like pattern matching, I think for reasons you point out.
#ACL2018
Q: Is it possible your field is ready to take those ideas up again, and stop working on what’s easy to solve?
#ACL2018
A: I agree with almost all of that. Everything good was in Bell Labs, associative reasoning is very stupid. But: It’s the best thing we’ve got.
#ACL2018
I’m not the only one in computer vision saying that deep learning has its limitations and that we need to look at a different class of problems.
#ACL2018
There’s a movement happening, though largely among those who have been around long enough to remember a time before deep learning.
#ACL2018
Q: Should representation itself be symbolic or distributed? (Cite to Jerry Fodor.) If a system should be come more smart, what kind of representation is going to be useful?
#ACL2018
A: That’s a key question. I think you’ll have to have both implicit and explicit representations. Humans have implicit representations… if you’ve forgotten how to walk, I can’t tell you how.
#ACL2018
But there’s also a whole bunch of stuff that’s represented explicitly. If you want to know that the president of the US is a particular person (EMB: taboo avoidance?), if you want to know the equations for gravity…
#ACL2018