From Deep Learning of Disentangled Representations to Higher-level Cognition

From Deep Learning of Disentangled Representations to Higher-level Cognition

November 18, 2019 22 By Stanley Isaacs


>>Okay. Good afternoon everyone. And welcome to the MSR AI
distinguished lecture series. I’m delighted today
to have Yoshua Bengio, from the University of Montreal, as our second in
a long series of speakers. Yoshua’s immediately
recognizable as one of the key figures in the Deep Learning
revolution that’s taken place in
the last five years. And so for those of you who
think he’s new to the field, just jumped in this century, he’s been at it for
more than 25 years. In fact, I wrote down one of his earliest papers was
something called data driven execution of
multi-layer networks for automatic
speech recognition, which was published in 1988, I believe at a AAAI.>>That’s 30 years.>>Oh, you’re right. I’m
not very good at math, sort of soft math. And many of the important
advances that we see today in speech, vision, text, and images, machine translation, are directly attributable to Yoshua’s work and
that of his students. And that work, that recognition
is evident in many ways. If you start with
something like citations, his work has been
cited last year alone, more than 33,000 times, that’s a career for
dozens of people and just a one year sample
of Yoshua’s influence. The other way in which
his work is felt, is in the long stream of algorithmic innovations
that he’s had. Most recently, they come in the form of
unsupervised learning, notably the work on
generative adversarial networks, on attention models, such as gating that’s been
used for machine translation, but really opens up
whole other doors to using a variety of
other data structures. And then perhaps the one
that’s a little more hidden, the form of influence
it’s a little more hidden, is his tremendous work in education and in
supporting the community. It comes from his textbook, which is advertised
there, a lovely volume. But, you note that it also has all of the chapters online. He’s worked tirelessly to promote new conferences
in the area, like ICLR, develop community, develop tools that are
broadly shared and to educate a whole new
generation of leaders, and I think that in
many ways may be as lasting a legacy as
all the technical contributions. And so today, what he’s
going to talk about is how we move on from all of the amazing breakthroughs
that we’ve seen along perceptual dimensions
and speech vision, and to some extent language, to go to a much
Higher-level Cognition, which has been a pursuit that
he’s been interested in for many years since the early PDP models and
perhaps before that. So please join me
in welcoming Yoshua.>>Thank you, Susan.
You hear me fine? Yeah? Okay. So. Yeah. You see me fine. So here, we have a little bug. We’re just going to. So, thanks Susan
for the kind words. And as you said, I’d like us to move away
from what we’re doing now, which is great and is giving us amazing industrial successes to something closer
to human level AI. And in that respect, I think it’s important
to look at the kinds of mistakes and failures that
our current systems have. And I spent some time
examining that. I’m sure most of
you are aware of the adversarial example’s issue
illustrated here from the work of my former students
Ian Goodfellow and his collaborators at Google Brain, showing that if you just
change an image a little bit in a very
purposeful for a way, you can completely
fool a classifier, but, if we go beyond
this sort of amazing thing, what can we conclude about the failures of
our current systems? And I would say,
the strongest thing I see is, that they are learning in a way that exploits
superficial clues, that help to do the task
they’re asked to do. But often, these are not the kind of clues that humans would consider to
be the most important, and often these
reveal that the models don’t really capture
the underlying explanations, the underlying nature like
the objects for example, as we understand them, by thinking about images
as coming from physics, and in the 3-D world. So, a lot of what are we talking about
is how can we move forward beyond this tendency
of current models to sort of cheat by picking on surface regularities, and
why this is important. So we just put out
a couple of months ago, a paper illustrating one more
time one of these bailings. So we take Deep
Convolutional resonets and we change something superficial in
the data distribution, which is the spectral
distribution in fourier space. For example by, filtering in fourier space the images
we take, images like these, and in the fourier domain
we apply a mask like this, which just smooths things out. So in other words, we get
rid of high frequencies, and you basically
don’t see much changed. But unfortunately, a network
train with these kinds of images does very poorly
on these kinds of images. And it gets even worse if you
do this kind of filtering, where you randomly enhance or reduce some of
the spatial frequencies, and now you get these images which humans would still
recognize properly, but there some weird colors that show up in some
places and so on. And that completely
offsets the Neural Nets. So if you’re train
on images like these, and then you test on
images like these, you get really bad errors. The error rates go. For example, from a say, six point five percent to
34 percent or something. I’m not going to
explain all of these, but basically you’ve
train on one of the data sets that has one way or the other of changing the spec the fourier
characteristics, and then you test on the others and humans would
still see the objects, which are the thing that
matter very clearly, but those net networks
would get fulled. You can of course, considerably reduce those effects by training the network on all of
these types of data, just like you can reduce
the effect of versatile exam of a show examples by training with
our gestural examples. But of course, then there would be
something else that shows up, because the network
probably still didn’t really capture the object nets that
we have in mind. So why is this important? As we are deploying Machine
Learning in the real world, what happens is that, the kind of data on which
those systems will be used, is almost for sure are
going to be statistically different from the kind of
data on which it was trained. And as an example of this consider self
driving cars or vehicles, for which would like
them to behave well in these rare but dangerous states. And unfortunately, with
the kind of models we have now, train which provides learning, if these examples are rare, they’re probably not going
to be learned very well. And [inaudible] to
these special kinds of domains like near
accident situations like in the picture,
might be difficult. So, humans are actually pretty good at dealing with such cases. I only had one accident
in my life, right. And how did they do
before the accident? And then how do
they do after that? We’d only use
single example link. We don’t need to die a thousand deaths to know how to prevent dying while we drive. What’s the catch here? My intuition about
this is very simple. We have a mental model that captures
the explanatory factors of our world to some extent. It’s not perfect. And we can generalize to new configurations of
the existing factor. We already know about trucks and cars and the physics
of objects and driving. And we already know about social behavior and all kinds
of general knowledge that allow us to intuitively know quickly what’s
the right thing to do, and many new situations completely different
from those we’ve seen. Because they involve concepts
that we already know, they just combine
in very new ways. Instead, our current
machine learning would just tend to drop dead on these statistically very
different situations. I think we need to make serious progress on the ability of our models to discover and understand
the underlying explanation, the underlying causal
relationships so that they can make valid predictions
in scenarios, in situations that are very, very far from anything
that’s been seen. And that can be important also when you consider
machines at plan. Because one of the reasons
we’re planning is to avoid those situations
like dying in an accident. But if we don’t have
a good way to project ourselves into these
situations that are very, very different from
the ones we typically see, then we won’t be
able to do that. As I’ve been saying for
more than a decade now, it’s all about abstraction
and deep learning as we conceived it
more than a decade ago, really was about learning
multiple levels of abstraction. And in one of the ideas that we proposed pretty
early is it’ll be nice if we had algorithms that can separate out
these explanatory factors. The phrase we use is disentangle the factors
of variation. The last NEBS there was
a workshop dedicated to this question and it’s still not clear
exactly what it means. But different people have different ways of trying
to formalize this idea, and I think we should
continue to try to do that. This notion of
disentangling is related, but different from
the notion of invariance, as being a very important notion
and computer visions, speech recognition and so on, where we’d like to
build detectors and features that are invariant to the things we
don’t care about. But sensitive to the things
we do care about. But if we’re trying to
explain the world around us, if we’re trying
to build machines that explain the world around them that understand
their environment, we should be prudent about what aspects of the world we would like our systems
to be invariant about. Maybe we want to
capture everything. And the most important aspect isn’t what we get rid of or not, but rather how we can separate the different factors
from each other. If you’re doing unsupervised
learning say with speech, you’d like to have features that capture the phonemes and you’d like to have features
that capture the speaker. And even if at the end
of the day you get rid of the speaker ID and you only
care about what’s being said, it works if later
you decide that, well, what you care
about is the speaker ID then, that also works. The other thing that
happens with disentangling is it’s going to help us to deal with
the curse of dimensionality. It is a good rule of thumb to understand what
a good representation is. The idea is when we transform
the data in the space, machine learning becomes easier. So, in particular, the kind of complex dependencies that
we see in them, say, the pixel space will
become easy to model maybe with linear models or factorize models in that space. And there are
many instances of this. One of the early ideas
which has given rise to a number of unsupervised
learning methods in deep learning is that
we are going to learn a two-way map or gobalistic transformations
between the data space, pixel space or whatever and
some representations space. So, we have encoders
and decoders and by now, everybody uses those terms. And so, what do we want from our encoder and what
do we want from the decoder. Okay, so one is sort of
the inverse of the other. But one of the early insights from around 2010 or something, is that what the encoder does is take
the data distribution, which is complicated, here is
represented as a spaghetti. So, the set of points
corresponding say to images is along this curve. So we have this spaghetti
that we, of course, we don’t know where
the spaghetti is but we’d like to
do is to flatten the spaghetti into something that is going to
be easy to model. Once it’s flat, then it’s like maybe a Gaussian
or something. And predicting things in
that space become very easy. Also, if you’re able to do that, so you know this
two-way transformation, generating becomes
easy too because if the manifold of points is
now almost a straight line, then generating points in
a straight line is very easy. And then, I apply the
inverse mapping, the decoder, and I get points in the image space that looks like images. So, that’s sort of
one geometric interpretation. It also comes in with this idea that in the high dimension, we have what’s called
Marginal independence. In other words, I can assemble from each
of the factors if dimension is independently they all depend on each other so it’s very easy to model
distribution in that space, whereas, modeling directly in
that space is kind of hard. So, that’s one view
that’s kind of interesting that
came up pretty early. And some of the algorithms
tried to do that explicitly. One of the things we
did also pretty early is think about how that
affects sampling methods. But here, because something on the flattened space
is easy but here, I’m showing
sort of toy experiment to illustrate what happens when you do this flattening with a very simple algorithm
for representation learning, which from those days these were stacks of
denoising autoencoders. So, what we’re going
to be doing is again, we going take
the spaghetti here, I just represented it
as a curved manifold. And it’s going to
be transformed into the hidden unit space
into something flat. And so, one interesting question is how do we know
that it’s flat? So, here’s a very
simple way that you can check that
a manifold of points is flat. You take two points
on the manifold. So, two images and you
take their average. So, you interpolate
between them. And if the thing in between
looks like an image, then it means that
the manifold is flat. So, if you take
these two images and I take the average
and the thing in between is also on the manifold, then the manifold is flat. If it’s true for
every linear combination, then you know, everything
sits on that convex set. But if the manifold is curved, then the things in between
will not be on the manifold. They will not look like images. So, we can do that experiment. So, we can take
like this image of a nine and image of a three. And if we work in pixel space and we just do averages
linear combinations, we get these images
which obviously, are just pasting
a nine and three together but don’t look
like natural digits. However, if we take the
same nine and we project it in the representation space of some unsupervised learner. And then, we do the averaging
in that H-space. And then, we take
these linear combinations and we project them
back using the decoder. We can look at
the corresponding images. And what we see is that the corresponding images
look like natural digits, all the way along
the interpolation line. And the two lines
here correspond to interpolating a two-layer, as the first layer
and the second layer. And there’s more than that, which is that if as we move interpolating from
this image to this image. So now, I imagine
this curved manifold, just for illustration
just like in the picture. What we see is that
it’s going to be nine, nine, nine, nine,
nine, nine, nine, nine and suddenly there’s like a very quick transition
where it becomes a three. And it’s all threes
up to this three. And the junction
that is something that’s in between
a nine and a three, but still preserves either
one’s or the other’s identity. So, all of this
gives us some nice intuitions about the geometry behind what we’d like to have in these
higher level spaces. And so, there’s been
a lot of work in applying these ideas to
generative models. As Susan mentioned, one of
the really hot family of algorithms are these GANs that we started off
three years ago. And they are pretty amazing, I’m not going to describe them. But I think we are still far off the mark. So, what’s missing? As I’ve been saying, if you analyze the errors
made by those systems, both for recognition
and for generation, you can see that
they clearly don’t understand the world
in the way that we do. And they are missing often the point,
the crucial abstractions. So, what’s needed?
More abstract representations. One aspect of this I think is that our learning theories
also have to be modified. Current machine
learning theories anchored in the IID assumption, assuming that
the task distribution is going to be the same as
a training distribution. And so we can get very
confident about our models, but when we go out of that assumption,
things can break down. Somehow, humans
are able to do well as I said earlier,
so what’s missing? So there’s this idea of
learning Disentangled Representations, and
one thing I tried to do in a 2013 review paper on
representational learning is present the idea that
we’re not going to get very, very good machine learning without introducing some priors, some assumptions
about the world. Deep learning is starting with an assumption that
this composition of different layers is appropriate for the data we want to model. That there is some
compositionality that can be captured with these
families of functions. But there are many other
assumptions that are as broad, as generic that some
of our models use, and I think we should continue
adding more elements to this list of priorities that we put in our models in order to improve a generalization, but ideally try to do it in
a way that remains fairly general and not
too task specific. So, these include things like exploiting the fact that
there are different scales, both spatially and temporally, and we already have
models that do that, but I think there’s a lot
more to exploit there. I already mentioned Marginal Independence, in
other words the idea that, if you take the data and you transform it into
the right space, then all of the factors
become independent. Something I already mentioned
earlier is this idea that, when we move the data
in the abstract space, the dependencies
between the factors, the variables in that space
become really simple. And later I’ll talk to you
about the consciousness prior which is one way
to implement this. And then, another idea I
mentioned is that of causality. So, I think the whole area of causal machine learning
is still very young and I wish there would be
more people exploring this because I believe this is something that humans
take advantage of, and it’s one of the ingredients
that allows us to generalize to these very
different scenarios. And I’ll mention
some work we started, what I call controllable factors
in that direction. So if we look at how humans learn in
a very autonomous way, like look at all the knowledge that babies have
by two year old, like intuitive
physics, one thing that is interesting to consider is how much they interact with their environment in order
to acquire that knowledge. And I now think that for upcoming progress towards
human level intelligence, machine learning needs to
move more towards this notion that acting in the world to acquire information is a
very, very important tool. Humans do it and I think this is something that we need
to pay more attention to. And of course, the basic tools from that come from
reinforcement learning, but we may have to
reinvent a bit of reinforcement learning to move
further in that direction. So, one particular aspect that we explored is this idea
of controllability. I’m looking for a prop here. Let’s see. So, here’s a prop. I just made up a policy here to control
this sheet of paper. I can move it around in
space, I can fold it. I can do all kinds
of fun things with it and I just made it up. I’ve made up these policies to control different aspects
of this object. So, what this
illustrates is that, our brain can come up
with control policies that influence specific
aspects of the world, like the position of
this object and different, complicated geometrical
attributes here about its folding. And clearly my brain can
represent that information so, I could report it to you,
I can tell you where it is, I can have a mental model of it, I can plan based on it. And so, we have these two things
happening in parallel. We have the representation and the policies and they’re
matched in some way. So, I have policies to affect
one aspect of the world. I can decide to move just this thing and I can represent
that thing in my mind. So I’ll tell you
more about this. But, clearly, in order to discover these aspects of the world that are controllable, which are not all the aspects of the world but I
think they give us very strong clues
about how the world works, you need to be acting. You need to be
trying things out. And I think as you do that, emerges almost naturally for the learner
very important notions from cognitive science, like the notion of
objects and agents. Well, the notion
of object emerges because the objects are the sort of aggregates of attributes that I’m
controlling together. I controlling x, y and z
and other attributes of this thing and they are all sort of spatially
coherent as well. And that sort of
things that can be controlled are
attributes of objects. And agents are the entities
that can do that. I’m an agent but I can
also see you do some things and imagine how I would be doing it if I were you,
or something like this. And we do that all the time. So, the notion of
agents is very, very convenient to
model the world once we incorporate of course actions and trying to control
things in the world. Now, there’s
something particular which sort of blew
my mind a little bit at the beginning about
the kind of knowledge that an agent acquires by
interacting in the world, which is that it’s not universal knowledge, it’s
subjective knowledge. So, the policy I have
depends on my body, and that’s related to what
people call affordances. There are things that I can do that maybe a baby can’t do. And so we have a different
vision of the world, a different understanding
of what can be done. And so, this is different from the maybe simple minded view
of AI where there is sort of a universal truth
that all of us see and all of us can control our bodies and what
we are able to do, kind of condition
our understanding of the world. And it’s kind of unfortunate but we have to deal with that. So, last year, we
started this project of trying to put in equations and algorithms
and experiments, this idea of
controllable factors. The idea that one way to
discover good representations, or at least some of the factors
in a good representation, is that we have clues about the existence
of these factors by the fact that we can control those factors without changing too many other things
in the world. So, we designed a term that would be added in a
training objective of a learner, in which we say, “Let’s look for a policy that is going to choose some actions
given some state, such that after we apply
the policy, the state changes. And, some feature, K of the world changes
as much as possible.” So, the Kth policy is going
to control the Kth factor, in the simplest way
to think about this. So I’m going to have
policy number K, control unit number K in my network and you
need to change as much as possible while the
other units are not changing, or changing as
little as possible. So, that’s the
starting point of this. And we can add that criterion to whatever other criteria here, it’s a simple
reconstruction error. So, we’ve been able to make
this kind of idea work on very simple worlds for now and still facing some
optimization difficulties we don’t completely understand. But that in little toy
problems like this, we learn both an encoder and decoder that maps
from pixel space. So this is like when we
image space into here, a very simple 2D space
corresponding to, I mean, it discovers that
what matters in this world is
the position of the ball, and the reason is
that the only actions that the agent can do here
is move the ball around. And so, it learns that what matters is the position
of the ball. And so, the encoder essentially just learn to take
images and spit out the position of the ball in some funny space and
the decoder can mount back. So if I play with the
coordinates here and then decode, I get a new image where the ball
is in a different place. And it turns out
we can interpret the changes here directly
in terms of position, and it has also separated the changes happening in one direction versus
the other direction. So that’s kind of nice. You can also do things like take two images in that world, of course, encode them, take the difference
between their representations and that will tell you basically what actions are needed in
order to go from one to the other because they correspond to how much of
each of the factors you should apply in order to
go from one to the other. So that was like one-step
actions, one-step policies. We’ve been extending this
to multi-step policies and also generalizing from a fixed any variable set of factors to a common
natural set of factors. Because if the kinds of factors I’m talking
about are things like positions of objects, the problem is that
how many objects are there here? How many objects are
there in the world? Do I need a different neuron for each possible object
that you could ever see? That doesn’t
make sense, right? So instead of enumerating
all the factors, what you’d like to do is
to have a name for factors. And the names is going
to becoming natural, like is going to be a vector, an embedding. And so, yeah. So, we’re playing with
these kinds of ideas right now, and playing the
same kinds of games. But now, the names are vectors, and what we’re showing here are the embeddings of the names of factors that
the system discovers. And here that it’s
the same kind of game as before, but now there’s just
been more positions and it discovers that the important factors are different positions
corresponding to a grid space because the world here happens
to have a grid structure. Anyways Okay. Yeah, I talked about this
already and time is flying. So, let me move on to a second
type of exploration which touches on I think something I’d like to revisit in our
training objectives for improvised learning and it’s that they are all in pixel space rather than
something like an abstract space. So if you look, of
course at likelihood, if you look at any kind
of reconstruction error, if you look at even things
like gown training objectives, they’re all focused on what is happening in what
I call pixel space, but that would be like
acoustic space, or anything, video space, cactus space, whatever the data space. So why is that a problem? If we were able
to map the data to this better representation as I’ve been talking about then, as I said, modeling
in that space, planning in that space, reasoning in that space would
be so much more convenient. But actually, it’s
a chicken-and-egg thing. We’re not going to be given like magically
those representations, so we’d like to have an
objective function that really focuses on the abstract space. And to see a little bit what can go wrong with
our current methods, let me share a little bit
of a couple of years of experience with trying to do speech synthesis
with Neural Nets. We made a lot of progress
now to the point where now Google is putting those kinds of Neural Nets
into their speech synthesizer. But, when we started
this research, we were very
ambitious and we said, “Okay, let’s do like a pure, hardcore improvised learning, take hundred hours of speech, and train a huge complicated
recurrent network to model what speech is sequences of 16 thousand samples per
second for multiple seconds.” And what happens with
our current best models, if they’re trained in
a purely improvised way. In other words, you
don’t give them words and phonemes is that they produce speech that sound like someone speaking in other
European language if he’s train on English. But it’s something
like there’s no words, this is like gibberish. So these models capture the texture of speech to
capture how speech sounds like, but they failed completely to capture the longer term structure, the
linguistic structure. Now, there’s
a very easy fix to this, and this is what you have in
a speech synthesis systems and all of the systems
that people use these days, which is used separately
train the model that, is you use improvised learning. You separately train
a model that goes from phonemes to acoustics. And from a model that just captures the statistics of sequences of
phonemes or in other words, a language model. And that works perfectly well. Now, we can generate
unconditionally by just first sampling a sequence of words or a sequence
of phonemes, and then conditioning
on that generate sounds. So, besides the fact that we have a fix by using
that knowledge, I think what this
teaches us is that our improvised learning
mechanisms have not been able to discover something that
should be incredibly salient, which is the presence
of phonemes as the characteristics
of speech. If you do like KEY means
clustering on acoustic signals, you discover immediately
phonemes, right? Or, at least broadly speaking, it’s very salient
statistical structure. How is it that these models
haven’t been able to discover them and then see that there’s like this really
powerful part of the signal which is explained by the
dependencies between phonemes? And the reason is, I think, simply that that part
of the signal occupies very few bits in
the total number of bits that isn’t
the signal, right? So, the rows in this signal is 16 thousand real
numbers per second. How many phonemes per
second do you get? Well, I don’t know, 10, right? or maybe 16. So there’s a factor of
a thousand in terms of how many bits of information are carried by the word level, phoneme level information versus the acoustic level information. And the long length queue, or any other criteria we use, as I said earlier, they’re
focusing on these details. And so, yeah. As our models get better and
better every year, the approach, something at the higher level, but
it’s very painful. Maybe we need to change
how we train these things. So now, let me tell you about a direction of
research that we’ve started which
attempts to fix this, and it’s very much inspired
by cognitive psychology, and very old work actually about attention
and consciousness. So one part of the idea is, we want to design
objective functions, where say, we could
have encoders, but we don’t need decoders. Where the objective function is going to be defined purely in the abstract space.
That’s part of it. But there’s something
else here which is, I think fairly new and
connects with classical work in the eye and knowledge
representation and symbolically. So, think about your thoughts, think about your conscious
thoughts more precisely. What happens is, at
any particular moment, there is something
that comes to your mind and it’s very low
dimensional, that’s my claim. You can convert that
into sounds, a sentence. Not everything is like this. For example, I can do visual
imagery and that’s hard to verbalize but many other things
I can verbalize. But even if it’s visual, it’s very low dimensional. It concerns very few
aspects of the world. So, why do we have
this thing which seems to focus on so few aspects of
reality at a time it seems, I used to think short
term memory is like crazy, why is it like we can only remember
seven things at a time? Our brain is so big, why
would we have this limitation? It’s sounds like we are under using
our computational capacity. Well, so I’m claiming that
this may actually be a prior that we use and that potentially machine
only could use, in order to constrain
representations. And the prior is that, there are, the assumption
about the world is that, there are many important things
that can be said about the world which can be
expressed in one sentence, which can be expressed by
a low dimensional statement. Which refer to
just a few variables. And often they
are discreet those we express in language,
sometimes they’re not. We can like draw or somehow, use that to plan. But they are very,
very low dimensional. And it’s not obvious, if priori that things about the world could be said that are true and low dimensional. So for example, again,
I’m using a prop. If I try to predict
the future here, there are many
aspects of it that are hard for me to predict. Like, where is it
going to land exactly? It’s very, very hard to predict, right? It’s a game. But, I could predict that
it’s going to be on the floor. It’s one bit of information. And I can predict
that with very, very high certainty
and a lot of what we talk about are
these kinds of statements. Like, if I drop the object, it’s going to end on
the floor, in this context. So, this is the assumption
that we’re trying to encapsulate in
machine learning terms with this consciousness
prior idea. So, one part of it is that
we need a mechanism that’s going to select
a few relevant variables from all of the things that we could have access to
with our consciousness. So, everything that
we can think we see, everything that
we can talk about, are the things that
we have access to from low level perception to
very abstract things, explaining what we’re seeing. We can come to
our consciousness. So, we need an
attention mechanism, in order to just pick a few things that are going
to go to our attention, go to our consciousness and that’s the attention mechanism. So, we’re working
with soft attention, soft cotton based attention, which is precisely the kind of attention mechanism
that we introduced for machine translation
a few years ago as Susan was talking about and has
been amazingly successful, not just from
Machine Translation, but for all kinds
of applications now. And I think we could use
the same kind of mechanisms. And so, I’ve been using
the word consciousness, but the word consciousness
is loaded with all kinds of meanings
and we have to be careful here
about what we mean. So different psychologists or philosophers are using terms
like access consciousness or the Heyns calls it
global availability, but basically it’s
just the aspect of consciousness that concerns the selection of elements on which we are focusing in
order to make a prediction, in order to act and that we
can usually report verbally. So, how does the content base attention work in
machine translation? We use that to select
representations, a different position
in a sequence, an input sequence
that needed to be translated so that when we decide on the next word
in on an R and N, that predicts the next word in a sequence for
the translated sentence. We can focus on
one or a few words, which are likely to contain the most relevant information for the next word to produce. And we use an
attention mechanism, you could think of it
like little Neutral Net, that takes two things and input. It takes one candidate location, so that where we want to focus, there are some features
that are being computed and it takes
the current state, which is sort of the context, in which we are going
to take the decision of where to focus. It outputs a score that says, how much we want to focus
our attention at this location. And we’re going to
compute such a score for all the possible locations. And then you can generalize this idea in all kinds
of ways but this is the heart of what we’re doing with
content based attention. It’s been used to greatly reduce the gap between
classical machine translation, based on engrams and human quality translation at least according to
human evaluation, a statistical method
to evaluate quality. So going back to
the consciousness Prior. Whereas in traditional
Machine Learning and Representation Learning, we think of this top
level presentation that captures all of the factors of interests here we’re going to have
two levels of representation. We’re going to have this very high dimensional
and conscious state, which contains all of
the abstract representations, but it contains everything and not just the ones that
are coming to our mind. Not just the ones that are
coming to our attention. The ones that we
are focusing on at a particular time
will be somehow stored in this Low-dimensional representation,
the conscious state. And just like in
the previous slide, we are going to have
an attention mechanism, which decide what to pick next to go into
the conscious state. And you can think of it like, it’s kind of a choose something out of
this soup of information. And presumably, they are
all going to be recurrent, because everything
is happening in time. And an important
element of this is, the reason we’re doing all this, is to put pressure on the mapping between
input and representations, unconscious
representation which is like the N quarter I had before. So that the N quarter would
learn representations, that have the property, that if I pick
just a few elements of it, I can make a true statement or very highly probable statement
about the world, maybe a highly
probable prediction. So yeah, the name, the objective here is the same
as I had at the beginning. So we are only
using all of this, to imposing this sort
of pressure, constraint on the
representation learner, so that it learns
representations that have this property that we can, this defines a language
in which we can say things compactly
that are true. And each of those things
we’re saying are like just a few
variables at a time. Another interesting
thing that happens here and that connects to classical AI is that we now going to have
to represent here, not just values of variables
but names of variables. This is something unusual
for Neural Nets and you start seeing things like this and models like
neural training machine, where we have these memory
with keys and values. So, the reason we need to
have names of things is that, for example, if I
make a prediction about a future variable, then what I’m trying to do, what I have to story here is, the name of the variable on
which I’m making a prediction separate from its value
somehow, because later, I’d like to be able to say, “Oh, I had made a prediction
about this variable and here’s the observed value and here was the
predict value and now, I’m going to have to
update my parameters.” And so, if I were to mush the names and the values, I
wouldn’t be able to do that. I need to be able to refer
to things indirectly. We can do that with Neural Nets, but we just have to design
them with that in mind. So one little scenario that one can look at is using the conscious state to make a prediction about
a specific variable, using only of course a few variables that are
part of the conscious state. And then we could just use a kind of
log like you would which tells us how
good is our prediction, weighted by on which variable
we’re making prediction, assuming we only
predict one thing at a time in this very
simple scenario. But just making this weighted prediction
isn’t going to be enough, otherwise, the system would just learn to extract variables
that are easy to predict, but are kind of
meaningless or useless. And there are
many such variables. So, we are currently exploring different training
objectives for this. But one important idea is, would like the representations
to have high entropy, in other words to
preserve a lot of information about the data. So that’s the idea of maximizing entropy of the representation. There’s also a very old idea which we’re going to be reusing, which is the idea of maximizing mutual information between past and future representations. So Sue Becker who did her Ph.D. with Geoff Hinton
and by the same time as me, use these kinds of criteria to extract features in
the spatial domain, that had the property that there was high mutual
information between the value of the features
at nearby locations. And I think this is something
that’s relevant here, for we’re trying to do. We’re trying to define a training objective for
this consciousness thing that makes past’s
conscious states highly predictive of
future conscious states. But not just predictive, but also high
mutual information. So in other words, that they also capture a lot of
information together. Okay. Another
potential source of training objective
which I would like to minimize as much as possible
but maybe we have no choice, is not just a kind
of predictability, but a sort of usefulness
for enforcement lining. And so, what do we
use our thoughts for? Well, it’s very clear that our thoughts have a very strong
influence on our actions. So our thoughts are used
to condition our actions and also plan our actions. So very often we have mental imagery and right
after we do something, because it helps us figure out whether it’s
a good thing to do or not. Right? So we could use this consciousness
mechanism just conditioning
information for policy, and allow this not just to
make a single prediction, but, imagine a future with unfolding of
many time steps in the future. A few more things
I want to close on. I mentioned that we don’t really want to have like a different neuron
for each factor, we want to use this notion of a district representation
for each factor. So you can think
of each factor as like a concept that we
can use language for, and so they’re going
to have an embedding. And there’s been actually some earlier work done
by Mike Moser in the 90s about how one could represent discrete concepts
in a neural net, using something
like Help field net. So if you imagine that you
have a group of neurons that collaborate with each other to move together towards a sort of stable
fixed point stable attractor, that sort of clean up mechanism, it is a sort of discretization that makes a lot of sense from base on
psychological experiments, to explain our ability to manipulate discrete concepts in our mind, and take decisions. Right. There’s of course connections between
this work and classical. Yeah. You can think of something like a statement
about predicting a variable given conditioning variables as just a connection is the way of talking about
a classical AI rule. Right? It doesn’t have to be
a rule, it could be a fact, it could be something about the current scene
that we know is true, or has a high
probability that is also represented in
the conscious thought. I think that this notion
of having to refer to variables by the representation
of the variables itself, the symbol has a name, is something that comes handy if you want to
implement some kind of recursive compositional
computation which is very common in classical AI. And of course, this also
makes connection to language. I’m hoping that it’s
going to help to associate perceptual words
with national language, to ground the actual language,
and vice versa. What I’m hoping is that when learners that have this
kind of consciousness prior, are learning with language and perception that
the natural language they’re getting from
humans is going to help the learners by giving them hints about
the high level abstractions. So I think when we
talk to our children, we’re giving them
hints about what are relevant abstractions that are useful in the world around them, and probably accelerate
their learning in this way. Okay. I see that
it’s already 4:00. Let me have a last slide here about something
very different. I think it’s time
the machine learning community, research community
starts thinking beyond developing
the next gadget. Of course those gadgets
are very profitable, but there are other things
that matter in the world. And it’d be nice if grad
students and researchers around the world instead of
working see an image net could work on datasets, that if we could have good
solutions for those datasets, could help millions of
people with medical problems, or education problems,
or environment problems. And I think that organizations like
the partnership on AI, to which Microsoft belongs
and as a founding partner, is a kind of organization
that would fit very well with a mandate of
making such things happen. But more than
that, it’s not just doing machine learning
on these kinds of AI for good applications, it’s also coordinating
this kind of work across many companies and
labs around the world. Because right now,
different companies are doing things more or
less in that direction, independently each
hoping to show off, ” Hey look. Where we are good. ” But I think we would
be all very much more productive if we
coordinated those efforts, if we prioritized what
could help people the most, if we talked, if we invested in talking to the people
in those say poor countries who could
benefit most from this, to understand better what
could have the greatest impact. And if we didn’t
just do that science. But also made sure that the grad students in those poor countries also learned the machine learning
that goes with it, and maybe come
for internships in our big labs and go back with the State of
the art knowledge, to their country to bring
that future wealth there, instead of us
trying to save them. So. Yeah. I think
there’s a lot that we can do that’s
not too complicated, that would be not just
the right thing to do, but also would in the long
run help all of us. Thank you.>>We have time for questions.>>What do you think about this inanes debate like innate. So, human brains solve the
problem of billions of years.>>Yes.>>Of evolution and presumably, there are a bunch
of the primitive construct that you can use as the base ingredient in
your unconscious stay, right? So the question is that, do you envision that we
should try to induce them again from data or
what’s the other way?>>There are two things
we’re combining already. One is, we are using
our ingenuity and math and insights to build
in the kind of priors, that evolution has optimized. And we’re starting to use the same kinds of mechanisms
as evolutionists used, like meta-learning
and all that stuff. So actually, my brother and I started doing
this kind of thing in the early 90s and just
didn’t work because, I mean, it worked on
so tiny scale that we abandoned, because it’s too
computational expensive. But now, we’re starting to have the computational power
to do these things. But ultimately, we also want to understand what are
the underlying principles. So I think we can use computing power to do some
of that work of discovery, but to also do it in
a way that we understand the principles that give rise
to good Machine Learning.>>So maybe let me follow up with sort
of a concrete scenario. For example, like biomedicine. So there is a lot
of this sort of intuitive or from experience, I saw some of
this kind of little bit like hand wavy reasoning
in medical decision making. And it is actually a very rich soul and body
of this kind of knowledge, like all the entities,
relations and so forth. So but, in the future obviously, we will also have a lot
of big medical data like, all the sensors and so forth. The question though is, do you envision
like we actually, we learn all those kind
of objective phenotyping, ignore all the prior knowledge. Or is there some kind
of a middle ground to?>>So there’s always
a middle ground.>>Okay.>>But the equilibrium point is shifting towards more data and less human
engineered knowledge, as we’re collecting more data. So there’s always a trade off and it depends on the quality
of the knowledge we have. So some things are stronger and we should always
use them and some things are so-so and maybe data should
be able to override this. If you look at the last 20 years
of NIPS proceedings, it’s all about specific
knowledge that people are putting in their algorithms, right? Yes?>>My question is about, what makes something part
of consciousness prior? Is this just.>>You mean, what things come
into our consciousness or?>>No. Like, there is these facts in the world
or the observations.>>Right. Right.>>Is it just another layer of abstraction you are
thinking about to make the computation easier or will it have some
special properties like, being shared across
agents learning, across different domains
kind of this belonging to common sense, belonging to what
agents share in terms of acting in the world like this cognitive special
properties in your mind, shared cross
applications and agent?>>So I think first of all, there is a lot of
knowledge that we have and that we are
consciously aware of, but that is still
hard to communicate, but a lot of what comes
to our consciences is precisely the stuff that
we tend to communicate. And it’s interesting to ask what about animals
which don’t have language. Do they have
something equivalent? I think some primitive
versions of that, yes. And we have probably a
stronger type of consciousness, because it’s enhanced through
our learning to communicate with others using that. But a lot of the common
sense knowledge isn’t something we are even
aware of consciously. I think that’s why
the classical AI program failed, because we were
trying to build like the roof of the house and
we didn’t have the scaffold. And the scaffold is perception and the low level
understanding of the world. So this is sort of near
the top of the house, that I’m talking about and
it’s not built yet either. And it’s connected
to reasoning and symbolic stuff
and all that. Yes.>>So I’m interested
in the one example, the linearity invention in
the first half of the talk. Like in the future space of
[inaudible] We should have [inaudible]. So does that mean like somehow maybe imposing linearity towards the end of the neural
network can help us to resolve some of
robustness issues about.>>Yes. So that’s precisely what the conscience
prior is doing. I mean, I didn’t
talk about linearity, but if you think about, what this equation
means, it says, that we take the data, we bring into this sort
of consciousness level, this representation
level and in that space, I’m going to have a predictor
that’s very sparse, because I just take
a few variables. And it’s also very simple. So the neural net that
predicts this guy and this guy is sitting on top here and hopefully, is
very, very simple. It’s just a simple
linear logic or one MLP. Very, very simple. Doesn’t need to be linear, it can
be whatever you want. But the point is, by constraining
the capacity at this level, we are forcing
the representation to somehow come up with those
factors, those representation, those features that have
this property that we can now do very simple operations
in order to predict stuff from other stuff. Yes.>>So kind of relates
to this question, earlier you had a picture
a true relation to images can you go back to that. So you show that
in the pixel space, interpolating images
doesn’t give you anything that make sense, but>>Well it does make sense,
but it’s not what we want.>>So you’re trying to say we should operate
in the abstract space. But another thing
that will give you the same interpolation
is for example, when you interpolate Wasserstein distance that will also give you something like on the top sort of something on the bottom.
So I’m just saying->>What do you mean by interpolating
the Wasserstein distance.>There is a well defined meaning of
very center of facts.>>So you mean define a metric that has the Wasserstein
distance locally.>>So is well defined notion
to interpolate in->>Wait, the
Wasserstein distance is a distance between
distributions. So it doesn’t make sense
what you’re saying.>>But in this case
you can think of the picture as
a histogram of pixels.>>No.>>No, it’s well defined,
it’s well defined. The purpose of this tutorial right or the last leaves
on optimal transport.>>Yes.>>So I guess what I am
saying is that the structure of this- or the subtraction
structure could be, in fact, a different structure from just a abstract space. And with the Euclidean
structure on it, it could be, actually, a more complicated structure.>>Oh, you’re saying
we don’t have to do linear interpolation at the top. We could do something different.>>Yeah.>>Sure.>>It is not
actually worth it to do the linear interpolation.>>Yeah. I don’t know what
the right thing but hopefully, it’s something simple that
can be learned quickly. That is the most
important thing, right? It doesn’t mean many parameters. That’s, I think
what, for example, allows us to do
one-shot learning once you were presented
things in this abstract space. Because relations are
simple and sparse, then, a single example or
a few examples are enough to sort of deduce relationships that you
didn’t know before.>>Yes. I agree.>>Yeah. That is the main
characteristic. Patrice?>>I want to challenge a rule
that what you said about->>I was sure you would.>>When you just said about the limitation
of deep learning. But before I do that, I want to challenge
the first thing that you said that humans are very good at learning in
a unsupervised way. So, let us take
the concept of complexity. Did you learn complexity from the physical world,
from your interaction?>>No.>>Where did you learn it?>>Not so many years ago.>>But, you learned
it in school, right?>>Yeah.
>>And I think that’s important because we learn
things gradually.>>Right.>>And I am going to claim
that the problem with->>I agree.>>Deep learning is that
the hypothesis space is fixed. And, the way human learns is that the grow the
hypothesis space gradually.>>I agree.>>And they grow it with help.>>I don’t see why you’re saying that deep learning
as that limitation. I wrote a paper in 2009
called Curriculum Learning.>>Yes.>>That should be and I think
you should go back to that.>>No. This is all in line with that. There is no contradiction
with what you are saying.>>Well. So the->>So, one of the
early ideas was that we gradually build new concepts. Thanks to the concepts
we have already learned.>>Yes.>>And I do not see- So, here, you think of it like I
am showing a snapshot but in the evolution
of the learner, presumably this space would
get richer and more abstract.>>Right. But I think this gradual growth of the hypothesis space
is where we need to focus and I think this a very interesting things
to model but it requires->>But that’s your focus. You are going to
solve that problem for us. Yes, I agree.>>I like to, just on that note, you come to the idea of
theory revision, right? As opposed to gradually
growing things, you talked about children
having causal theories. They often have causal
theories that are wrong.>>Yes.>>And you have
another example that completely changes
the theory, right? It is not like gradual changes
not repeated exposure. How does that fit
into this notion of just massive revisions of what the underlying
representation looks like?>>Well, I liked that question because I
don’t have the answer.>>Yeah, you know that one?>>Right. So, the good news. So, here, I mean knowledge is at
different locations here, but in particular, there’s
the mapping to representations. Then, there’s this very
compact representation of how the things are
related in that space and sort of corresponding to the set of rules
if you want, right? And the good news is that, the set of rules is
very easy to change. That’s where you can
do one-shot learning. This is where you can
completely change your view on something without having
to rewire everything. And that’s connected to classical idea with
the idea that, “Oh, I can keep all my
rules except they change this one and now my conclusion is
going to be completely different where that
rules is relevant.”. So, by factorizing
representation from facts and rules to
make things a little bit of, you know, grow sketch, I think makes it much easier to do what
you are talking about. Whereas, if it is sort
of hidden in the mush of one big neural net
that does everything, it is kind of difficult now to change your mind about
something specific. And the representation
is never wrong. It’s just- It might be
insufficient to generalize, but it’s never wrong. It’s
just a representation. The actual facts
are represented in this top little set of rules if you want so I
think it facilitates here. Thanks for asking. I
hadn’t thought about it.>>If you want to make a drastic change
with a few example, you need very low capacity.
If you have high capacity->>That’s right. That’s
what I said. The top thing has very low capacity.
It’s sparse. It only uses very few variables and hopefully it is somethings
trivial like linear.>>So, how is low capacity
tricked with deep learning?>>We all know that the deep part is
the representation, right? I mean everything
is deep learning but the part here that is
traditional deep learning is the mapping from say images or image sequences to
that representation space. And that is where most
of the capacity is, and that is where it’s hard to learn and that’s
where we don’t do the job yet because
I think we’re not putting as much
pressure as we should on the representation
to the abstract and to have these- well, easy that we can make predictions in
that space very easily. That is what this is saying. Yeah.>>So, we get that we have a unconscious state and we
put the attention goes to a conscious state too but make up getting more broader concept. So, does the unconscious state encode all the
world information?>>All the what?>>All world information.
Or is it specific to the->>All the world information?>>Yeah.>>Like, for example,
colors. So, I want to paint->>Pretty much everything
that you can name, right? So even low-level stuff. If you ask me like what does
the color of that pixel, I can like pay attention
to it and tell you.>>So the input has to be
from different domains.>>Yeah. It’s very rich,
this unconscious state. But mostly it is interesting not because it has the pixels, but because it has
the higher level things. Yeah.>>That’s it. This notion
that we can only can put maybe save things in our memory
prove interesting that you see this as a
good prior or not.>>Yes.>>Bug of our brain.>>That’s right.>>But there’s a bunch
of other things where kind of irrational
or we’re not good at handling like vastly
different skills and we act in kind of irrational
ways and I wonder if any of those would also be
interesting priors for dealing with the real world
as opposed to seeing them as kind of
failures of our breed.>>Well, maybe. But I think we have
to try to think of how they could be useful from a machine learning point
of view and then say, “Ah, maybe we could use this. It’s a meaningful thing
to add and not just, “Oh, let’s shoot
ourselves in the legs because humans have
that failure as well.”>>Right. But I think
it can be useful that the roads are
really complicated and we need to build the- as humans, act in that world and we’re trying to
reproduce that, right?>>And so some of
these things might be necessary sort of priors. Just to be able to
learn quickly enough.>>Right.>>For you to
assume or, you know, act as if you’re
never going to have to deal with things that are in 10 orders of two different skills at the same time
for instance.>>Yeah. But, we have
also to be careful, not to try to
reproduce everything that we know about
humans and the brain because some of
these things might be side effects of
our particular hardware or, you know, evolution
is imperfect and, you know, we like to
understand what we’re doing.>>Yeah.>>Last question.>>So, I am curious about
what you’re saying about using the low capacity
representation of the world for causal reasoning and reconciling that
with the fact that well, I mean, yes we we often have a very kind of abstract approximate reasoning
about the world. We know birds can fly for
example but then, oftentimes, when you’re trying to really think through
a specific situation, the devil is in
the details, you know.>>Yeah.>>Do you think that
a good model would only use a single layer of low capacity information
or do you think there’d be some range of.>>So, I don’t think that these abstract low
dimensional thing is the only space that
matters for reasoning. At each time we project
ourselves in in the future, when we think about something, all the low level stuff is there hidden and influencing
what’s going on. So I think that’s one reason why traditional
rule-based systems fail because they are
an incomplete description of what’s really
going on whereas, we are able to use
our intuition along the way, which is hard to do with with pure symbolic
rule-based systems. So, by connecting
the low level stuff with the high level stuff
and keeping them connected, I think we can avoid that trap.>>Please join me in
thanking Mr. Wright.