Michael I. Jordan: Machine Learning: Dynamical, Stochastic & Economic Perspectives

Michael I. Jordan: Machine Learning: Dynamical, Stochastic & Economic Perspectives

August 17, 2019 0 By Stanley Isaacs


– So, to start with, let me just introduce our Dean, Mung Chiang,
and he is gonna introduce our distinguished lecturer. (audience applauding) – Thank you, Stanley. It is indeed overwhelming
to see so many nerds here at a nerdy talk with a
title Machine Learning: Dynamical Economic and
Stochastic Perspectives. I would never thought
if the word stochastic is in the title of the lecture, that we’ll have standing room only, and also been streamed live
and archived recording as well. And people are still streaming in here. Must because of the outstanding lecturer that I will be introducing
just in a minute. But this is one visualization
that Purdue Engineering is the largest among top
10 engineering schools in the United States. Now, as we get towards
the end of the semester in this academic year, we’re
delighted to host one more, and there is one more
at the end of the month, of the Distinguished Lecture
Series from Purdue Engineering. We started this about a year ago, and each year we bring
about eight most outstanding speakers and scholars
from around the world in different disciplines. And the hope is to inspire all of us towards the pinnacle
of excellence at scale. In the case of data
science, machine learning, artificial intelligence,
they mean different things, but, let’s say, we bundle
them somewhere close to each other for now, there is a lot that we can
be doing here at Purdue. I’m proud to say that in
applications of data science, areas such as infrastructure
monitoring to imaging, digital agriculture to
advanced manufacturing, Purdue Engineering has
tremendous talents and success, in the application of
data to these domains. In the foundation of data science, including machine learning, we are growing with tremendous speed and
momentum, for example, glad to brag about our work in computation and visualization of data, including the hardware-software system of brain-inspired computing
led by ECE faculty, Kaushik Roy and Anand
Raghunathan that won SRC DARPA $40 million five-year center in the nation in brain-inspired computing C-BRIC. And today’s distinguished
lecturer is somebody who has reinvented the field from so many different
perspectives, computational, and stochastic and statistical,
cognitive and biological. I met Michael Jordan, I
think, about a year ago at the 60’s Birthday party
for my former college advisor, Steven Boyd, and I listened
to Dr. Jordan’s talk, contrasting the philosophical differences between the derivative view
and and the integration view in control versus optimization community, and it was fascinating. And I took the courage to ask if Michael will be interested in
visiting Purdue next year. And thank you so much for taking the time. And I’ll abbreviate Michael’s
outstanding biography to just a few sentences, Dr.
Jordan is a member of both, NAS and NAE, and a member
of the American Academy of Arts and Sciences, he
has received numerous awards from diverse intellectual communities, from IEEE and ACM, to SIAM and IMS. Notably, just last year, during the international
congress of mathematics, Dr. Jordan was a plenary
speaker at that event, which was, I think, 2018 was the last one. And we’re all so excited that
you are here today, Michael, to talk about your perspectives
in machine learning. Thank you so much. (audience applauding) Thank you, sir. – All right, thank you,
it’s a great crowd, it’s great to be here at Purdue. You have one more announcement? Yes, please. – These gentlemen, gentlemen, can you move to the overflow room, because we’re not allowed
to have more people than this room can hold. – [Michael] Although there are
quite a few seats down here, if someone of them
wanna quickly grab them. – We do have an overflow room,
can you just walk with Maria, we will gonna bring you
to the other room, okay. We have live streaming,
so don’t worry about that. I’m sorry to be so–
– That’s fine, that’s fine, I’ll give them a couple
of seconds here to defuse, in fact, I may take away my jacket, so this is for somebody, next to the dean. (audience chuckling) – [Mung] I don’t byte
in the afternoons so. – Hey, you are very brave, all right. (audience chuckling) So this talk is for the
young people in the audience, I’m glad to see a lot of
students, this is for you. You know, what things should
you be thinking about studying, what are the opportunities, and so on. So I’m gonna bounce back a
bit between philosophical, conceptual issues, and but all the way, try to what research problems come from these considerations so. I’ll also have a little
bit of a blend of academics and industry, what I’ve liked
about being in the Bay are is all the industry around us, and it’s influenced me a lot, I’m also going to China a lot these days and seeing development
of IT industry there, and it reflects my thinking
quite a bit, as you gonna see. Okay, so, let me talk about
this field that’s called machine learning for now, if
you were there this morning at a panel, then you know
that I kind of think of this as statistical decision
making under uncertainty, that’s the real field that
we’re all talking about, but here’s a buzzword, and of course, it’s being called AI these days, and I, if you were there this morning, I kind of resent that, I
don’t think that this what, that’s what this is, at
this point in time at least, we don’t know what intelligence is. But anyway, it’s not
new, it has most things, but it’s not even new in
terms of success story. So, just very briefly,
I kinda have companies like Amazon.com here
in mind, but, you know, already in the 1990’s they were
taking vast amounts of data and they were using it
for business purposes, and particularly for fraud
detection, if you’re building an online commerce system,
you can have fraud rates like credit card 3% fraud rate,
so they use machine learning to take in large amounts of
data about fraud, not fraud, and make the distinction,
they got really good at that. Even more important, perhaps,
is supply chain management. It sounds maybe dull, you know,
it’s business school thing, whatever, it’s not dull, and it’s broader than business school, it’s all about, you’ve got a billion products, and they’ve got to be
assembled and brought to the right place at the right time to meet certain customer needs, and different seasons, and all that. And in the old days, if you
have a thousand products and, you know, a million customers,
that was sort of classical, you could build it by
hand, and eventually, no, it had to be all done by data. So companies like that already
in 1990’s were modeling the probability that shipment get stuck, and any notion, so you
wouldn’t get certain parts, and, you know, do that across
not just a billion products, but all the parts that
go into those products. And so they continued to
this day to have, you know, hundreds of people working on that, and companies like Amazon,
Alibaba, and so on. Having built those
systems and support that, and doing lots of other
things, like A/B testing was quite important in that era. Now they have the
computers behind the scenes that could do all this data analysis, this is how they started to
think about other services they could offer now for
people, not the backend, ’cause the data in the
backend were about people, but why not just directly
influence people, you know, give them new services. So recommendation systems,
you’ve all seen that at places like Amazon, you
go and interact a little bit, you buy some books, they start
to recommend other books. That was critically important
for that industry to arise, and for Amazon to become what it is today. All done with machine learning, all done with just kind of
gradient descent algorithms that we’re talking about
today, and in fact, the algorithms haven’t
really changed that much. Okay, so kind of what changed is now in its third
generation, there’s this focus on pattern recognition problems, more of the kind of human variety. Computer vision is like
trying to imitate human vision or animal vision, and speech, and so on, and all the algorithms
with the new data sets, and computing are able to do really very well on those things. So people got all excited,
I think, probably too much. So given that there was a kind
of billion dollar industry, implications of the ideas already, and people didn’t talk about it too much, but, you know, very, much
more important really, than bullet three, so far
economically, just no comparison. But bullet three is kind of
about human capabilities, and so people getting both,
panicky and excited, you know, but it’s gonna be decades
before we get kind of anything resembling intelligence in computers, and probably not even, it
probably will be more than that. So what it is, it’s gradient
descent-based data analysis at large scale, and so it
kind of can find things and do pattern recognition, and that’s all primarily
almost becoming a commodity. Now, is that the end of it? Well, definitely not, so
I think what’s emerging is, what I’m gonna talk about, is it kinda the decision side of machine learning. So if you think about we
should least two parts to it, one is find patterns and
recognize them, and so on, the other is make decisions, definitely not the same thing. Even if you think about one decision, but consequential one,
not about where you put an ad placement, or something, of if there’s a giraffe in the image, rather like a medical decision. So you go into your doctor’s office, and doctor takes all the data available at that moment in times, from maybe from all around the world, it’s floated, and they measure all kinds
of things about your body, including your genome, you know, height, your weight, you blood pressure, but many many other things,
they put it in the some big machine learning system
and it outputs that you look like you’re about
to have a heart attack, you need to have a heart
operation tomorrow, and so that a decision,
not just a number coming on the network, it’s like
0.7 is the threshold, it’s 0.72, alright. Are you just gonna accept
that as a decision? Well, no, you’re gonna
wanna have a dialog, you’re going to say, well, where are your error bars, first of all, and current systems aren’t producing much in the way of error bars. Secondly, you gonna have
some what if questions, What if I were to exercise more, or what if we get a second opinion, or what, and even more importantly, where did the date come from that led you to make that 0.7, okay. If you build a system that’s
just working right now on today’s data, which what people do, they take the image data
set and they use that data to try to network, and let
it run for four-five years, it’s gonna be out of date
in all kinds of ways, this is called provenance
in database field, but they’re just kinda interested in counting aspects of the data,
inferentially it matters when the data was collected
for who on what machine, to decide whether it’s relevant
to the current decision. Okay, that’s one piece of a
very much bigger sort of thing, that’s just one decision,
you want a whole dialog about the decision back and forth. But now, it’s rarely
one isolated decision, it’s almost always a decision in context of other decision makers, there’s often big batches of decisions, like a system like Uber
right now in Chicago is making huge numbers of decisions about where to allocate things,
and so on, commerce, the systems are doing this, and so on. So that’s the really level
we want intelligence, we don’t want it at the
individual agent level, or replacing a human being,
which is really really not gonna happen, we want it at the overall sets of decisions, okay. So let me instead dig
into this a little bit from the panel kind of
discussion this morning, but people are now using word
intelligence for, you know, AI for this data analysis
and decision making problem. So do we really have intelligent systems? Well, I’m gonna argue,
no, and I don’t think we’re gonna have them in a
meaningful sense of intelligence. We have things that mimic
intelligent behavior from data, that’s different, alright. So suppose you’re up on Mars, and you’re a Martian computer scientist, and you’re looking down
at the world, the Earth, you have very primitive computers, you’re trying to get
inspiration for how to bring more intelligence into your computer, So, you look down on Earth. What do you see on Earth
that’s intelligent, worth kind of trying to mimic? All right, well most people certainly think of brains and minds. Maybe we can figure out how
those brains are working down there, they seem to be
doing something intelligent, and understand the mind. I would argue, we’re not remotely close to that understanding, even a single neuron, it’s a cell, and it’s got all kinds of
proteins swimming around and all kinds of
structures, and then it does all this electrical stuff,
and it’s got this branching tree with, you know, tens of
thousands of ramifications, and et cetera, I can keep
going forever, you know, even a single synapse is
extremely complicated. So it’s gonna take more
lifetimes than I have to give, it’s gonna take hundreds of years, really, to start to understand, not just a neuron, but we put down billions
of them all together in this complicated way,
you know, we’re not there, we’re not even close, we don’t
have the principles, okay. And even in psychology, we
don’t understand really, how abstraction abilities
of humans really arise, and ability to talk, give a
talk and talking semantically, and communicate, all those things, it’s really interactions
of us with the world on all kinds of levels of
abstraction in time and space, extremely challenging. So, the poor Martian
computer scientist says, well I can’t do that, I can’t
figure out all that stuff. How or is there something
else down on Earth that I could look at and try to make more intelligent, right? And maybe you’re already
get my setup a little bit, most people don’t, they
scratch their heads, and they go, maybe animals, you know. No, not so, a squirrel
is a remarkable robot, but it’s not super intelligent. All right, so I’d argue that we, you know, maybe people just kinda
missing what’s obvious, that if you take that Econ 101 class, you’re told to look at a
city like Chicago, you know, every restaurant needs to
have all the ingredients it needs for every dish
it serves every day, every household has the
food, it’s not perfect, but it works really amazingly
well, it works 365 days a year for 3,000 years at all scales. Something is intelligent
about that, it’s adaptive, it’s robust, it works on
many scales, et cetera, all the kind of things people
want out of machine learning, it has some of those properties,
doesn’t have all of them. All right, so to me, and we know, what some of the principles are, whereas in neuroscience and psychology, we don’t know what they are yet. There it’s microeconomics,
we have some ideas of those principles, it’s
not we don’t know everything, and we have to think about
new kinds of markets. Now, what kind of markets? Well, markets that bring in
data analysis, all right. So minimally, think about something like a recommendation system. Suppose I have two-way market, I got producers and consumers, and they’re not just
attached to each other in the usual traditional way,
rather they see each other through a recommendation system. Already that’s an interesting question, and that’s a billion dollar
question to do that well, okay. All right, so I’m gonna be digging into things like that during the talk. All right, a little more philosophy before I really get going. So, I wrote a blog this past
year, if you haven’t seen this, I would like to encourage you read it. I don’t usually advertise my
work, but this was on Medium, and I think I’ve had
about 400,000 views of it, in fact a lot of kind of famous people, I usually get four or
five views of my papers. (audience chuckling) Too many uses of word
stochastic, but anyway, this has been read by a lot of people, and I do want more people to
read it, have this dialog. So it tries to say, look, this
buzzword AI, first of all, it’s not one thing, we shouldn’t be lumping everything together,
it’s a mistake to do that. There is the classical,
let’s imitate the human idea, there’s nothing wrong with that, it’s what McCarthy had in mind and others, for better for worth, it’s
what a lot of the people, or self-proclaimed AI
people still have in mind, which is this aspiration
that we have computers now, there’s hardware and there’s software that looks like brain and mind, maybe in our generation,
we’re gonna be able to put intelligence inside
of the computer, okay. So that was McCarthy in the 50’s, Turing was thinking that way. I’d say, in the intervening 50 years, we’ve made very little
progress on that, frankly. And I’ve been in the
neuroscience department and I did psychology,
and so on and so forth, and I’ve been watching
this for a long time. So it’s great to continue to aspire to it, but it just not what’s
happened, all right. What has happened is what sometimes called intelligence augmentation,
different people, which means that the
computer’s in itself not smart, but it organizes information the way that helps make us more smart. And, certainly, search engines do that, and all kinds of computing
things in our life make us more intelligent,
more creative too. That will continue, all right. Search engines, arguably, not intelligent, I don’t think anyone
would argue that it is, but behind it a lot of
engineering intelligence went into the design of it, and it makes it more intelligent, okay. All right, but anyway,
what I think is emerging is more interesting than just even that, it’s this intelligent infrastructure, some maybe call it Internet
of things, if you will, but Internet of things was more about just more prosaic problem
of getting IP addresses for all kinds of objects,
right, that’s fine, but, more interestingly, is
whether all those objects have data streams associated with them, and those data streams are
a little bit incoherent, and they have to be made coherent, so the decisions could be made at scale. So that’s kind of the bigger problem. And think about not
doing that just for like factories and cars, bu think
about the Internet of things like in medical domain, you
know, all kinds of sensors, all around body, and so on, hospitals. And all that data flows, so that people get better and better treatment over time, the
system supports that, that’s what I have in mind. Okay, so one last little slide about this. So I don’t think it’s human-imitative
AI is really the right goal for a lot of these things, ’cause that really isn’t
about one, making a smarter, a computer smarter and
replacing a human, all right, it’s make it a system that works. So think about self-driving cars. Should it be an autonomous
car, should we go for autonomy? If you read most people’s
writings about this, that’s what they say, we want autonomy. No, you don’t want
autonomy, you want the cars to communicate among each other. If a boy just ran out the
street, a car sees that and tells all the other cars around it. Every car tells every other
car where it’s trying to go, what it wants to do, it’s more like air traffic
control system, right. And so you don’t want autonomy,
you want federated system that trades off things and interacts, and you want the principles
to build such a system, critically, and those
principles do not emerge by looking at a single car and a driver and trying to replace the human at just focusing on that
as the core problem. Okay, we can actually solve
the problem without putting a fake driver so much, we’re
gonna have all these centers and all that sort of thing, alright. It’s kind of ridiculous if think about it, that if you think about previous
engineering disciplines, like I in this blog, I talked
about chemical engineering, civil engineering, which were
super exciting in their day, and it took a few decades to roll out, but people really did great things there. And they developed other
kinds of principles that didn’t exist before,
like from chemistry to chemical engineering,
it’s a big gap, right. And those principles are what we should be thinking about right now. So, can you imagine a
chemical, a chemist saying, we need to create this field
called chemical engineering, where we know how to create factories, the way we gonna do
that is we gonna create an artificial intelligence
entity, or artificial chemist, who’s as smart as a
chemist, who’ll figure out how to build the factory, right. That’s ridiculous,
that’s not what happened and it’s implausible, but if you read like the
literature from, say, the DeepMind or something,
that’s what they’re saying. We’re gonna figure out how to solve cancer by creating an artificial
intelligent agent who then looks at all
this data and figures out how to solve cancer, come on. (audience chuckling) All right, so anyway, if you
had, people talk about AI, if you’re gonna use
this term, which, again, I’m gonna push back on, as long as I can, it’s data plus autonomous machines that you’re talking about,
that’s all you need, is the data and autonomous
machines, well no, you need it in the context
of markets and tradeoffs, and people, and utilities,
and so on and so forth. Okay, and lastly, one
last philosophical slide, which is that the IT
companies that are doing a lot of this machine learning
and building these services, are really not thinking
about a market at all, they’re thinking about, we’ll
gonna build a search box, or a social network, and it’ll
be a service that you like and you use, all right,
it’s all in a virtual world, we’re not trying to connect
producers and consumers and make a market, we’re just
trying to provide the service, and we know you won’t pay for
it, ’cause it’s not that good. So since you won’t pay for it, we have to create an advertising market and we make our money off that, all right. Rather, why don’t you try
to think about producers and consumers relationships
and make that so strong that people will be willing to pay for it, and you take a cut. And this is not a mystery,
Uber does this, okay. So in the world of
transportation, it’s not perfect, it has all of its issues, but
in the world of transportation they’ve created, they don’t
advertise on the Uber platform, right, they don’t need to. And this advertising model,
you know, is really broken, a lot of, the role out of IT, it’s really put us in a bad place. So we need to think of a
new way, and I think markets is a good starting place. Okay, so now what do you really work on when you work on this
kind of set of problems? Well, here is some of
the challenges in this, let’s call it II, these are
things I’ve been working on for the last 10 years, and
sort of scan your eyes there, there’s some pattern
recognition, but not really, I kind of view that as
kind of pretty good shape, all the gradient descent there, but, you know, real-time,
cloud-edge markets, multiple decisions, and so on. So, I’ve decided, and
the rest of the talk, I’m just gonna pick two or three of these, I’m gonna say a little bit
about conceptual issues, especially for
decision-making and markets, and then the last part
I’ll go a little bit through some of the actual
mathematical algorithms and issues that arise,
and how some of them can be actually solved. They need a little bit of new
mathematics, but not always. But I wanna give that flavor as well. And if you’re interested in the latter, what I’m gonna be doing
is giving you snapshots of a talk I’ll be giving
tomorrow, where I’m gonna dig into the actual mathematics more. Okay, so let’s talk about
multiple decisions, okay, so again, AI classical
people don’t think too much about multiple decisions,
they think about network outputs the decision, and you’re done, or maybe a sequence, that’s
called reinforcement, all right. But often you have lots
of federated decisions, so let’s think about that. All right, so when
decisions interact because there’s a scarcity of resources, you know, that’s what econ people
talk about, and people in AI haven’t been thinking
about scarcity very much, so in fact, here’s, again,
one of a big success stories, as recommendation systems,
you know what they are, they take data from customer
and they cluster customers, or cost of products, and make
recommendations between them. All right, so they’ve been
used for all kinds of things, like movies, was an early one, books. So suppose that I build a
recommendation system that makes a recommendation of a certain movie. Now, is it okay to recommend
the same movie to everyone? So will this happen first of all, yes, that’s how recommendations
work, it’s a big black box, they take a lot of data, when I come into a site like Amazon, they make a feature vector
out of me, you know, like 40,000-dimensional feature vector, they put it into this black box, and out pops a list of
recommended books or movies. Later, you know, someone else comes in, Mung comes in to the same
site, they featurize him, and maybe it’s a nearby feature vector, but whatever, it’s a
different feature vector, they put him into the same box, and they’ll recommend
some books or movies. So they can easily
recommend the same movie to me and to him, easily,
and probably, if it’s Amazon, they recommend the same movie
to a 100,000 people a day, that happens, I’m sure, all the time. Is that a problem? Well, no, there’s no scarcity here. What about books? If you recommend the same book
to a 100,000 people in all, is that a problem? Well, no, it used to
be there was scarcity, but now you can print on
demand, you can get books, like into the warehouse within a day, so you don’t even have scarcity there. Do you have, well, scarcity meaningful? Of course it is, in the real world there’s scarcity all the time. All right, so here, just
came from a little bit of traveling in China, I
watched people building business models on recommendation systems, ’cause recommendations would
have to become a commodity, you can download software
to do a large-scale, really distributed recommendation system, so people are doing that for things other than books and movies. So here is one, how about restaurants, so would be nice, I like
to, I arrive in Shanghai, I don’t speak the language
very well, and I’m by myself, I’d like a recommendation
system to know about me, and give me a recommendation,
and a high-quality one, not just an advertisement, all right. Well, there are companies that
try to do that, all right, so they download some
recommendation software, they take some data,
wherever they can get it, and they put them into
it, maybe they do okay, but it’s not what I want,
it’s a list of things, maybe some reviews, it’s complicated, I don’t want all that
complexity, all right. Moreover, if it works,
if it starts to work for a few people, that’s fine, but as soon as like half of
Shanghai is using it, you know, five million people,
you can easily recommend the same restaurant to
10,000 people or more, and they all go there,
and there is a big line, you’ve created congestion,
okay, so it’s not unfamiliar to an econ person. But a lot of the CS people
realize this after the fact, it starts to get scaled,
and they start to have new problems, they didn’t think about. Well, come on folks, you should’ve thought of that beforehand. In fact, it’s not that hard
to solve this kind of thing, you create a two-way market. And what that means, is
that I’m in Shanghai, I pull up my cellphone,
it’s 6PM, I’m ready to eat, I’m hungry, I wanna push
a button on my phone, have the phone geolocate me,
and have my feature vector somehow inform a recommendation system, and then have that
transmitted to all of the, to app to all the restaurants around me, and then their app, it says, here is a possible client for
you, his price point is here, he likes Sichuan cuisine, or whatever, and then someone will
then decide to bid on me, then a bidding mechanism will ensue, on my phone I’ll get a bing,
and I’ll see a restaurant, and I’ll see some dishes,
and I’ll see a price, and I’ll see the distance,
and I’ll say, great, I accept. So it’s like Uber, it’s
not that complicated. Once that that transaction has happened, that seat in the restaurant is taken, and if he comes in later,
you’re too late, okay. And if I don’t accept,
maybe then they’ll offer me a better discount, it’s a
market, right, and so on. And then other restaurants
around can see when one is full, and they can make other, they
can make discount offers. That’s how it works, all right. And it’s not that hard to do data science in support of that, it’s
just not mostly being done, the IT people think, they’re gonna understand
everything about humans, just like advertisers do, and
then give them what they want, right, it’s silly. Here’s even, here’s another one. What if you build up a
recommendation system to recommend people routes to the
airport, or wherever, right, if you have very few people
using it, no problem, as soon as half the city is using it, you send everybody on the same street. All right, it’s obvious, people know this. But then how you fix that? And the mindset in Silicon Valley is, well, we fix that by doing
a super fancy AI, you know, this is Zuckerberg, he uses the word AI, even when he doesn’t know
what he’s talking about. (audience laughing) Our AI systems will figure it out. And what does that mean? Well, they’ll understand
enough about humans to know what they really want, okay. It’s, yeah, I hope you see
how silly that is, right. So how do you do this in
the right way, all right? Well, if he and I are being
sent down the same street, or, you know, 10,000 of us are being sent on the same street, well, the system shouldn’t have to
figure out who gets the street, we should have a bidding mechanism. So if I can reveal that
I’m not so in a hurry today to get to the airport,
I’ll take a backstreet, it’ll take probably five more
minutes, and I’ll pay less, and I’ll save the money for a future trip, I’m gonna be happy. And he is in a big rush,
he gets to have the street for a little more money,
he’s happy, all right. That’s the right way to do it. And so, how you do this,
well just, literally, every piece of street bids on the people that pass over the street, and maybe be new market mechanisms
are needed, but you know, that’s kind of the way to
think about the problem. And here is my favorite example, again, it comes probably
from commuting in China, so you know, people now
have a little bit of money, so grandmother’s got a 1,000,
you know, or a 100,000 RMB, she wants to invest it, she
doesn’t know what that means. Her son says, hey, I can download
an app on your cell phone, it’ll invest it for you. And she says, great, so then
the app will recommend to buy, you know, Alibaba stock. That’s fine if it’s like a few people, but what if half of China is using it, when then Alibaba stock
shoots up artificially, and we’ve destabilized the market. Okay, so I hope you get the
feeling that the, you know, data science is needed here, these are data-oriented systems, it’s not just Costco markets, where I connect producer-consumer,
I have a classical link, it’s all about data analysis. But the two together, so
if you’re technically, it’s microeconomics meets statistics in a computing framework. That was three fields together,
some power that has not even been started be talked
about, really, or realized, as an academic and as a business person. Here’s another example. More people are making
music than ever before, because laptops, you can make, you know, my 12-year-old makes pretty good amazing little songs on his laptop. You can drive a taxi during the week and put up music on the weekend, and people will actually listen to it, but you’re making no money, ’cause there’s no market
for you, and that’s bad, when there’ no market for
human creativity, okay. So, how do you fix this? Well, you don’t just stream
that stuff to people, and then, because they’re
not willing to pay for it, create an advertising
mechanism to monetize, that’s Spotify and so
on, or a subscription. No, you create a market, and it’s here, it’s not all that hard to
everybody who’s putting music up on SoundCloud, you
give them a dashboard of, the data’s been flowing,
let them see their data, so I learned that I was
popular in Peoria last week, you know, 5,000 people listened to me. Now that I know that, I can show that data to the venue owners in
Peoria, and they will say, wow, I see if you come here,
we advertise to those people, it’s not even advertisements,
information flow, that you’re coming,
they’re gonna be excited, they’re gonna come, we’ll fill
the venue, you make $10,000. And then if you do that
three times during a year, you’ll start to have a salary, all right. And that can happen not just for a few super star singer types
that the record companies decide to anoint, that
could happen for like a million people in a country, okay. So mechanisms like that,
as simple as they are, they use data together with
markets to create jobs. All right, and you can do this now not just for entertainment,
but for information services. That’s really what YouTube should be, it’s more of an information service, than just the entertainment thing, that someone put up some
entertainment, okay. Okay, that was the economic
side of multiple decisions, let’s now go back to the statistical side, so if you’re a statistician, this kind of cartoon
will be familiar to you. How do you make decisions
in the real world? Well, part of it you have context, and that’s where markets come in. The other part is you have uncertainty, and you better be really clear
about that uncertainty, okay. So here is the typical decision. Jelly beans ’cause acne, someone has this as a hypothesis, okay, some
great new idea I’ve got. So it sounds ridiculous, it probably is, like most great new hypothesis are. So, the scientist say,
well, I’m gonna investigate, and what’s that mean, it, ideally, means doing an experiment. So I take a 100 people, put 50
in the jelly beans category, 50 no jelly beans, and for
six months these people will eat jelly beans every
day, these people eat none, and after six months I look
at their skin condition, and probably there’ll be some differences, but if you’re a good statistician, you know something like a
permutation test or something, you know how to get
p-value that says, well, if there’s really no difference, probability that I’d see differences, you know, it’s high. So I’d say, okay, I get it, it’s not real, I’m not gonna make a
discovery in that situation. All right, so that’s the classical setup that we’ve all learned, all right, but that’s never the real world. In a real world, people
say, oh, I see, my dumb idea wasn’t so good, I don’t give
up, I try some other dumb idea. And I try, keep trying
a whole bunch of them. So if you, whoever worked
in a hedge fund industry, or been around friends in that, that’s all they do all day long, they think of clever little ideas, if the price of this goes up, this, and the try that on a historical
data and see if it works, it usually doesn’t, and
they’re smart enough about the uncertainty and know that, and they don’t bet on it, and eventually they find one that works, and they bet on it, okay, but a lot of fields do this, all right. So they say that the
person comes back, says, oh, I see, it’s not jelly
beans, it’s green jelly beans, or it’s red jelly beans, all
right, and they keep trying. And I think you know what’ll happen, we’ll just be very clear about it. So everyone of these is,
the scientists come in, and they get a new one,
fresh batch of a 100 people. So we now have a kind of
overlap problem, all right, but you know what’ll happen,
finally, at some point, they’ll get a 100 people by chance alone, they’ll take the 50 people who already have a bad skin condition, they’ll put them in the
jelly beans category, and the other 50 will go here, after six months these people
have bad skin condition, okay. But you don’t know that’s the reason, and you say, well, I
discovered, I made a discovery. Now the problem is that
everyone will get all excited in the laboratory, we’ve made a discovery, send it to the journal,
the journal’s excited, ’cause it’s an interesting
result, they publish it. And then even worse, the newspapers whose job is to scan the journals and find the interesting results of the
year, say that’s interesting. It’s interesting because
it’s probably false. Okay, so this is not a
new to me, obviously, or to statisticians, but
statisticians work on this, all right, and not that many
people outside statistics, think too much in that way. And it really is a multiple
decision making problem. And so, just to give
you a little structure, I’m gonna tell you a little
about a false discovery rate and tell you about economic perspective on a false discover rate. So here’s kind of a setup
is that, say I’m doing nine tests of hypothesis, or I’ve
got, you know, nine ideas, and suppose that in five
of the cases on the left, the gray ones, there’s
nothing to discover, a typical situation, just nothing there, and if I see a difference,
it’s by chance alone, whereas in four cases, there is actually been
discoveries to make. So I run some procedure, or
I reduce some neural net, or whatever, and say in the
four cases at the bottom, it makes a discovery, P9, P8, P2, P3, but actually, god knows
this, that only two of them, P2 and P3 are real, the
other two are false, so the fraction of false
discoveries is two out of four, and so half of my discoveries are false, that’s not too good, all right. False discovery rate
is just the expectation of that proportion, okay. So now are there procedures to
control false discovery rate, it’s definitely not just
the classical threshold, and I’ll put over neural
net or something, all right. And to really drive home this difference, let’s look at something
a little quantitative. Let’s suppose we’re in some industry, we’re doing 10,000
different A/B tests today, and this is really, it’s
what Amazon has a website, and you’ve been there, it’s
kind of amazing looking, that’s not some designer that
did that, that’s A/B tested. They said, let’s try
green instead of blue, let’s try this instead
of this, they just try it every day on half the people get this, half the people get that, roughly, and they do maybe 10,000 a day, all right. So let’s suppose that we’re in a situation that the industry is a little bit mature, that 9,900 of those tests they tried, there’s really nothing to discover, it’s not better to put
blue instead of green, it’s just not true. Mature scientists like that,
most of the things you think of are actually not real, but 100
of them are real discoveries you can make, and make some
real money off of it, okay. Now you apply your fancy
machine learning techniques, and you have a really good
machine learning system, it’s probability making
error of a type one, meaning that when there’s
nothing to discover, you said this a discovery,
is smaller than 0.05, so of the 9,900, only 495
of them you’d say discovery, when you shouldn’t be. Similarly, your neural net
has got a very good power, meaning when there is a discovery made, you say there is a discovery,
and the power is 0.08, so, again, pretty good. So your engineers have
designed a great system. But now just multiply
out, 0.05 times 9,900, there 495 false discoveries
out of the 100 non-nulls, you made 80 discoveries,
if you add them up, your false discovery proportion’s
495 out of 575, right, meaning you go right to the
boss at the end of the day, and say, okay, I gave you a lot of money to do all these tests, how many discoveries did you make today? You say, oh, I made 575, here they are. Boss then says, how
many of them are false? You say, uhh, 495, all right, that’s bad, you’re gonna now spend a lot more money following them up, trying them out, you’re gonna find out
they don’t really work. Okay, so there are other
mechanisms to control this at the level that’s
proportionate at the level 0.05. Or is it just, you have to live with this? No, there are mechanisms,
there’s something to do with Benjamini and Hochberg,
it was the first one, but it’s very batch
oriented, you take a huge batch of decisions,
wait for it a few days, and then do it all in one lump, all right. So, we’ve been working on
an online version of this, more of an economic version of this, where instead of making
test, after test, after test, at some fixed level, where that level has to be really really small, as you make more and more tests, that’s kind of a causal respective, we let that level change over
time, and here is the key that falls is currently
apportioned as a ratio, so you can make a ratio small in two ways, the numerator can be small,
or the denominator can be big. So if you’re someone who’s
making a lot of discoveries, you’re gonna get, some maybe
more and more wealth, okay, because that ratio be under
control for you, all right. And if you’re not making many discoveries, your alpha will go down, and
you’ll start to see that. I’m not making any discoveries,
I’m in the wrong field. So what does a human do at that point? Do they just continue
to do the stupid thing over and over again, and then eventually can’t make any more tests,
’cause the results aren’t well, no, they move to a different field, where there’s new discoveries
to be made, and that’s good. That’s what the statistics should tell us. So, anyway, we were told that
Tijana Zrnic in the middle is kind of leading this in the last round, she has a very nice paper on doing this, a distributed asynchronous
setting with dependence kind of just a really
really real world thing that is industry ready, her latest paper. So we have a way of setting
these time varying alphas, let me just show you a
picture, it’s very economic. In the beginning of your
life I give you a budget of a certain number of alpha points, and every time you do a test
and you don’t make a discovery, you lose some alpha points. And if it keeps happening,
eventually, you dry up and you can’t make any, you
can’t do any more science. But if you ever make a discovery,
they go down, down, down, you made a discovery, suddenly
I give you some more wealth. And there is a formula
for doing this, all right, and the formula correctly tells you, and the following pretty
strong result, which is that you can stop me at any
time during my lifetime, say, how many discoveries
have you made up until now, and I’ll say, you know, 45. What fraction of them are false? Well, less than 0.05, and you can do that any time in my lifetime, or you can even do that over
a group of people, okay. So this kind of way of thinking exists, and it should be everywhere,
and it sort of needs to be. So if you think that decision
part of the machine learning, this is really a big part
of that story, okay, so. All right, so, let me
return to that slide. So these were, I talked a little bit here about multiple decisions,
and I talked a little bit about markets, I won’t talk about any of these other things, really, but when the last part of my talk, I’m gonna spend a little time getting down into actual algorithmic
and mathematical challenges that arise when one starts to work on these classes of problems. So, now we’re a little more
technical, with no apologies, this is where the kind
of, if you’re a student, you need to be learning
about these kind of topics, so most of the problems
we work on are nonconvex optimization problems or
the sampling problems. And nonconvex optimization is
all things about dimension, about saddle points,
about dynamics, and so on. We’ve gotta be quantitative
about all these things, it’s not enough just to be metaphorical. And the sampling where we
also have nonconvexity, and we have dynamics, it’s complicated, and we wanna link this to optimization, so we have an overall toolbox. And then we gotta bring
market perspectives here too, where we often interested
not in avoiding saddle points when we’re optimizing, but
going to saddle points, ’cause those are equilibria, where there’s a trade off being realized. All right, so again, I
have a whole bunch of work on all this tomorrow, I’m
gonna go through some of this material more slowly, but let
me give you a few highlights, and someone will tell me, when I start to run
out of time here, okay. So, this was an early paper for us, it really help set the tone of this, of a bunch of our projects after that. Chi Jin, who was on the
job market this year, has lead several of this, and
she has become a world leader in nonconvex optimization
via this line of work. And in particular, escaping
saddle points efficiently is really really important. So here is a saddle point
in three dimensions. We’re gonna focus on how
do you, you’re coming down, rolling down the hill, and
the saddle point is kinda bad, because it slows you down,
and it may slow you down for a large amount of time. Now, in three dimensions
it doesn’t look too bad, but if you’re in a 100,000 dimensions, there might be only one
or two directions out, and it may take you long long
time to find those directions and eventually escape. So we need to quantify that, how long, is it exponential in
dimension, is it polynomial, what is the rate of
escape from saddle points. And if you see, real
practical learning systems, in fact in light of all
that, what you’ll see is the error goes down really really fast and then plateaus out, and it
stays there for a good while, and then it dives again,
and then it plateaus out, it keeps doing this, and
those are saddle points. And if you wait for not long
enough, you think you’re done. All right, and in an online
system trying to make decisions, you might, you know, just
have to make your decision, but you should know,
that no, you’re not done. And we should have a theory that supports that kind of inference. Okay, so, we’re getting a
little bit of a math here, but I’m gonna just highlight a few things. So, first, this is a
result here on the left, let me just focus on this line right here. So this is a classical
result due to Yuri Nesterov, this is for the convex case,
so we have a ball shape. And I’m gonna run gradient descent, which takes steepest
descent direction, okay. And I wanna get to a ball of size epsilon around the optimum, there is a
single optimum in that world, and I wanna be a ball of size epsilon. Question is, how many steps
does it take to get in a ball of size epsilon, all right,
for a convex problem. And that’s kind of has a
little bit of complexity theoretic side to it, and so
it’s not an old result, 1998, and a number of steps
is given by, right here, it’s one over epsilon squared, okay. So if I want a little small
ball, it takes me more steps, and it goes as quadratic, all right. Moreover, there’s a Lipschitz
constant here at two, and the initial distance to the optimum, we’re trying to optimize
a function f, all right. That’s a beautiful result,
this is not asymptotic, it’s true for any epsilon,
there’s no hidden constants, and all constants are
nice natural ones, okay. So this is a kind of result
you aspire to in this field, and over time, this kind
of result has been achieved for lots and lots of
areas of optimization, and there are lower bounds,
and these tend to match the lower bounds, okay. So we said, when now if
you run gradient descent on a surface that’s got saddle points, what are you gonna arrive at, okay? So we had a parer that’s
showing the asymptotically, you will not arrive at the saddle points, so that was known for continuous flow, but not for discreet, so we proved that. Then we proved that gradient descent alone can take exponential time in dimension to get away from all the
saddle points, so that’s bad. All right, then there is
another paper that shows that, if you add some noise, stochastic versions of gradient descent, you
can escape all saddle points in polynomial time, so that’s
a very important result, but it’s just polynomial, it
could be d to the 45th power, or d to the third, or something
not so good, all right. So we studied this for a while, and we came up with a result, here it is right at the top, which is the number of iterations
in a nonconvex landscape, okay, to go past all the saddle points and arrive at a local minimum, is, again, one over epsilon squared, so it’s as if you are in a convex problem with stochastic gradient
descent, okay, pretty amazing. There’s a Lipschitz constant, there’s initial distance
optimum, so, again, it’s one of these pure beautiful results, except for that little tilde there, a little tilde is
traditionally used to hide dimension dependence, ’cause
no one was able to analyze it, but here we did do the
analysis of dimensions, that’s the whole point of the paper, and it turned out to be not
polynomial, not exponential, but actually logarithmic. Our particular proof
techniques based on coupling arguments from probability
using Brownian motions, and that’s responsible for the
fourth power to be in there, I don’t think it’s really a
four, it’s probably just log, but that’s what we were able to get, okay. So, all right, so that’s an early result using probability idea using
some nonconvex geometry, and using the simple
form of dynamics to show that you gonna actually have
a very very favorable results. So we will talk a lot about
why did these large scale machine learning things work,
well, stochastic gradient, everyone kind of agrees
is a reasonable thing, and this is actually
supports that folk wisdom, this is a theoretical result
that shows why it works. All right, the next critical step, and that’s even more critical,
so 15 minutes, thank you, even more critical, which
is to start to understand these things more deeply,
you wanted it to know, well, that result we just showed,
let me just go back there, actually, I should be using
this, that’s you know, pretty, what we’ve done, that looks like it parallels this case,
it looks pretty beautiful, is that the best you could
possibly do, all right. And that’s a really
important question to ask, this is now a real complexity theory, which meaning in some
machine, in some setup, is there a lower bound,
is it best you can do, so you know the field is finished, when you arrive at the lower bound, okay. And so mature fields tend to
have lots of good lower bounds, and statistics has quite a few, like Cramer-Rao lower bound,
you may have heard of, information there is quite a few. I should say, I don’t think
computer science has very many, okay, they have a few,
but they’re very low, there’s a big gap between
them and the actual upper bounds that are known. It’s a hard field, but
it’s also a little newer. Okay, well, optimization
theory has some very good lower bounds too, and it’s probably ’cause it’s the older older field, and a lot of them came
from the Russian school of Nemirovskii-Nesterov at al. Okay, so, here we gonna
work on lower bounds, and so I forget what I
have on the next slide, let me just see, no, I don’t have it, so let me just say something in English. In the world of gradient-based methods, suppose I build a machine
who can get a gradient, I have access to gradients
and nothing else, okay, function values and gradients, all right. What’s the optimal rate of
convergence for that machine? Okay, that’s a complexity
theoretic question, and Nemirovskii answered that and showed that it goes from one over epsilon squared to one over epsilon, so much faster, okay. Actually, that’s not quite
right, it goes all the way to one over square of
epsilon, so even faster. And so there’s an algorithm
that achieves that, and that was discovered
afterwards by Nesterov, and it’s an algorithm that
takes not just one gradient, but it takes two gradients,
and does a kind of great clever combination of them, and this was a big surprise to people that this was even possible. And that algorithm goes
at this faster rate, and achieves the lower bounds, and it’s kind of the best algorithm so. All right, so we worked on
this problem and we said, well, that algorithm is still
very hard to understand, and it’s called an accelerated algorithm. What does it mean to accelerate
in the optimization world? In the optimization world you’re hopping along a set of points. What does it mean to go faster
on that set of points, okay? Doesn’t really clearly, and so this is a part of
the problem, you know, people don’t really developed
a good general theory of acceleration, because,
I don’t think they have the right topology to support
it, you need a continuum, where you can go faster, all right. So you need embed the
problem in continuous time, and we did that and found that that gave a huge amount of insight. In continuous time, you can turn up a nob, and you can accelerate,
accelerate, accelerate until some place something beaks, so there’s a phase
transition in continued time, which doesn’t exist in
discrete time, all right. So another kind of meta
message here is that both optimization and
computer science almost always focus on discrete time
algorithms, discrete everything, and you’re missing some
insights by doing that, you gotta go to continuous time. Okay, so we did, and it turns out that the acceleration algorithms,
due the Nesterov, and so on, and a whole bunch of others,
all came from a single object we call a Bregman Lagranian,
and this is in tomorrow’s talk, I’ll dig into a fair amount. But there’s a mathematical
continuous time object, it’s the function of
position and velocity, and it has something called
a Bregman divergence in it and a few kind of parameters around, if you do standard regularization
you get out a certain differential equation, and if
you specialize these alphas, and betas, and gammas
to particular choices, you get specific dynamical
systems that are the ones, Nesterov’s and a mirror descent one, and a cubic regularized Newton, and all these continuous time algorithms that’ve been studied over the years, all fall out of this one master equation. Moreover, this master
equation shows you that no matter what rate you
chose, you can do it in continuous time, but you
will always follow the same path in the phase space, so it
actually has nothing to do with speed at all, has to do
with the path you followed, it’s a geometric, acceleration
has to do with geometry, and not with just speed, all right. Moreover, if you ask it to go
too fast in a continuous time, that’s okay, you can
go as fast as you want, but that just means you
sort of change your clock, it’s not really that important, all right. But if you trying to go too
fast in continuous time, there’ll be a breaking point
at which you can no longer discretize this differential equation, it’s impossible mathematically. And that breaking point is
where you made a discovery that there is an algorithm
that transition back in discrete time, that there is a, you cannot do something
in a certain place. Okay, so that’s for tomorrow. At the end of all this work we were able to develop some new algorithms,
because now we have this Lagrangian, we turn
that into a Hamiltonian, we use something called
a symplectic integrator, which is a smart way to
integrate differential equations that’s very stable. And now we just put all
that into the computer, and it’s able to optimize
using a simplectic integrator, ’cause I was talking a little bit about derivatives versus integration. Here is using integration
in the optimization setting, and it’s just as good as Nesterov, but actually it’s even
better than Nesterov, ’cause if you turn up the
step size, if you see, we moved over to the
left, we’re going faster, Nesterov flies off
unstable, whereas this new integrator stays stable, okay. So it’s a really good way to get downhill. Okay, so, again, for
tomorrow’s talk, I’ll talk a little bit more about the
consequences of all this. Now we have a little bit of,
if we gonna continue this time, we’ll get some insights,
we also know a little bit how to deal with nonconvex
geometry, saddle points. What if we put the two
together, for example? So what if you’re flying down the hill, but it’s not convex, there’s
some saddle points down there. Is it good to be having acceleration? And there’s been two different intuitions, and this been open
problem, some people say, well then flying down the
hill, I hit the saddle point, I’ll just go roll back up the
side and that’ll slow me down. Others say, no, acceleration
allows you somehow to blow past the saddle
point, it’s been intuition. So, anyway, we worked on
that with these two tools, we used our continuous time Hamiltonian or symplectic framework,
and we used this, you know, nonconvex geometry
coupling probability idea, and again, this is Chi Jin who led this. And at the end of the day,
we were able to get, again, a very strong results, there
is our result at the very top, and I don’t wanna get into the details, but the rate went from
one over epsilon squared to one over epsilon to the seven fourth, that’s a better rate, that’s faster, okay. So this is a proof that
acceleration helps you in a nonconvex setting, you go, you fly past the saddle
points more quickly with acceleration, so these tools allow us to get at results like that, okay. Next kind of step we’ve been doing is to put this into
domain of stochastics now. So here is a question, if we
don’t do just gradient descent, but we do stochastic gradient descent, or we do a diffusion, Brownian motion kind of driven dynamical system, and we’re trying to get downhill quickly, is there an optimal way to diffuse, okay. So if you ever learn about
diffusions or Brownian motion, it was probably in physics or in finance, and it’s really, in both
cases, is uses a model of some phenomena, of how
things move, all right. But as an engineer, we’re
often interested in, I want it to move in a certain way, I want it to go fast
down the hill, all right. And so people in statistics
know this from like Markov chain Monte Carlo,
you design an algorithm which should diffuse and get to an answer, and they love it went
fast, but they don’t have the mathematical tools to do that. They sort of talk about mixing
times, but can never really get their hands on that,
so it’s kind of been an unsatisfying thing. Optimization theory tells
you how to get down fast, and we’ve even found there’s
optimal ways to optimize, that’s what this Bregman
Lagranian telling you, there’s optimal way to optimize. Is there optimal way to diffuse? So that’s a brand new
class of problems, okay, just I’m gonna give you
a couple of results, but for some of the young
students in the audience, this is decades of work,
this is gonna be really interesting and
challenging, and, you know, for a young Kolmogorov
who’s around, you know, he would probably, he
or she would probably start working on this, ’cause it’s really very very pregnant with possibility. So anyway, what you do here
is you study things like Langevin Markov chain Monte Carlo, it’s just a gradient descent, that’s what, I’m not using f anymore, I’m using U, but I have a gradients and
I add some Brownian motion to give me the stochasticity. So this is a kind of
classical thing to study. It turns out it has a rate,
and that rate has been analyzed here, here is a good
example the rate in green. So it’s one over epsilon
squared, which is kind of surprisingly fast, given I
have all the stochasticity, but it has a d and it’s not
a logarithmic in dimension, it’s just d, that’s kinda bad, that’s just kind of stochasticity here in all these dimensions. And that’s a recent result,
this is a very important paper by Durmus and Moulines, but they studied this stochastic
differential equation there, it’s just gradient descent
plus noise, all right. The work I’ve bene talking
about has these two gradients, and it has kind of two equations to it, it’s oscillatory, it’s more momentum. What if we put momentum into
the stochastic framework, will that help? And again,
this has kind of been open, people hadn’t really known how to do that, or at least how to analyze it. Well, here’s how you do it, you just wright down two
equations, not just one, and you put the gradient
and Brownian motion in the velocity term, and you integrate the velocity to get the position, so it’s more of a second order dynamics. All right, so now, can you analyze this stochastic differential equation? And that’s kind of, again,
the fun mathematics here, is yes, you can use some
of the same coupling tools we’re talking about, it’s
a reflection coupling, instead of a classical coupling, and use Ito calculus to do this, instead of just regular calculus. But it’s nothing
particularly all that hard. And after we did all that
analysis, we got a rate, which was not just one
over epsilon squared, it was actually one over epsilon. So much faster for this thing. And we got one from d to square root of d, which is even more impressive. So this algorithm is way
better than classical Langevin, which is better than the
classical MCMC algorithms, like in sampling and all. Okay, so we’re really
starting to get closer to better algorithms for MCMC,
and they are nonreversible, and they are based on second
order accelerated dynamics, and the inspiration came from
optimization theory, okay. I’m gonna skip this, we’ve
got five minutes left, just, again, if you come
tomorrow, you’ll see more about a comparison of optimization sampling, and also a relationship between the notion of how fast can you sample. Let me tell you about
one more little thing, which is go back to this
market design issue, so I hope I convinced you earlier, that thinking about markets
is a nice way to think about lots of emerging IT problems, but, again, what is the
algorithmic mathematical challenge. Well, market design is a field of its own, mechanism design, market design. You have to solve, to form a market, you have to do some kind of algorithm that moves you in some parameters space, and usually you’re finding equilibria, like a Nash equilibrium,
it’s where, you know, he goes down and I go up, and so both of us are
as happy as we can be. So, we know we’ve become
experts on gradient algorithms in high dimensions, what if
you run gradient algorithms not to find the bottom of hills,
but to find saddle points. And this is a classical field of study, the classical algorithms
do one step of go down and then one step of try to go up, and so they try to find a saddle. And that’s provably doesn’t
work, you can oscillate, it’s just, it’s a known
failure, all right. So we’ve been working now on this, one problem we’ve worked on is how do you find saddle
points in high dimensions. And we wanna find this Nash equilibria, and this is actually
different than saddle points. So a Nash equilibrium is a saddle point, but it axis parallel,
his axis is one axis, my axis is the other axis,
I wanna be going down, he wants to be going up. If I take that same
saddle point an tilt it, and put it out there somewhere,
it’s still a saddle point, but it’s not a Nash equilibrium, okay. He’s gonna make progress on
his axis, I’ll make progress, we’ll move off, we would
like to move off of that. But our gradient-based algorithms
don’t know the difference, all right, and so classical
algorithms for this in econ aren’t gradient-based
and are way more complicated, they don’t scale. All right, so long story short, with Eric, I should’ve introduced the students here. Eric has been working on this with me, and then my two other
students Lydia and Horia, been working on the other
problem I wanna briefly mention, which is competitive
bandits and two-way markets. So bandits are a beautiful
way to think about decision making and
statistics machine learning. I’ve got k options, I don’t
know which one of the options is the best one, I try
all of them a little bit, and I start to figure out
which one looks the best, and I start to pull
that option more often, pick that more often, right. But I also have uncertainty,
and so I have to make sure I cover the things I’m uncertain about. And so there’s algorithms like
UCB that do this pretty well. But it’s not been done
in the economic context of other decision makers usually, okay. So what if both me and
Mung are doing this? We are both trying to find
the best option for us, and the other side, the other
side of the two-way market, there are merchants over there, and they may have preferences among us, we don’t know those preferences. So we start pulling these
arms, and I start to see that I’m liking arm one, but
I realize he’s liking one arm, and I starting to realize
that merchant over there prefers him to me. So I start to hedge my bets
and look at the other arms. So there should be regret
bound that reflects that extra exploration needed in the
competitive situation. So that’s what Lydia and Horia, we don’t have a paper yet on that, that’s very active at this moment, and we do have a paper with Eric. And let me just show you Eric’s result. This picture is my last one, I’ll finish, let me parse it just very quickly. There are three green crosses there, in this particular problem
those are Nash equilibria, you see there’s saddles, and they’re actually Nash equilibria. You’d like an algorithm to go find those. There’s a blue one there
which is a saddle point, but it’s not a Nash equilibrium,
it’s tilted, all right. We ran kind of a bunch
of different algorithms that are gradient based on this, and example was the black one. If it starts at those
red points, it’ll go down and find the Nash equilibria,
but it will also go to that blue point, it doesn’t
know the difference, alright. Our new algorithm which
is kind of gradient plus a little bit more,
is the red curves there. It goes towards the bad equilibrium and then it moves away, it
moves to a Nash equilibrium. All right, so, first of
all, it’s interesting to analyze this, we’re
still not done with that, in particular, what’s the convergence rate of this algorithm, ’cause
you’re paying an extra cost that you went toward something
bad, you had to sense that it was bad and move
away, so it took you longer. But you have to measure
that somehow, okay. Okay, so that’s all I wanna say, let me just have a few concluding remarks, it was kind of a whirlwind tour through a bunch of different ideas. Again, this slide I already had earlier, just sort of say it more slowly, the computers are currently
gathering huge amounts of data for and about humans, to be
fed into learning algorithms. And often the goal has
been to use all this to imitate humans, to try to
make computers smart like us, and again, I already against that goal, I don’t think it’s
really what’s happening, and I don’t think it’s
the most interesting thing to be doing either, okay. It leads you down the
role of the whole point of the computer is to learn about people and provide services to
them to understand them. And it’s a little bit just
implausible that you can do that even with five people, but
think about 500 million. Are you really gonna
understand 500 million people from their browsing patterns? No, all right. So we wanna provide this
in the context of a market, and so when data flows,
it’s not just to be used for learning algorithms,
it’s used to create value, to create markets. And if you are IT person
or if you are entrepreneur, which I hope some of
you are in the audience, I hope you resonate to
my message which is now, if you think about it this way,
you don’t have to make money off of advertising, which
is where Google and Facebook have all gotten stuck. That’s why they’re having so much trouble doing the right thing, ’cause that, and so if you say, no,
my goal is to create connections between producer and consumer. How can I do that? You’ve created a market,
that probably is gonna be a more healthy thing for
humanity, just overall. Okay, so this slide I’ve been
using for about 10 years, but let me just have
it up there at the end. This field is coming of
age, but it’s really not, it’ll be quite a while
until we have really what I would call an
engineering discipline. We have just people
building things out there, sometimes trying to do the right thing and build good services for people, sometimes just trying to
make money, sometimes both. But we really need is this
engineering discipline, where we start to think
about, what is the problem, how do we assemble all the pieces, how do we break out of our
classical boundaries of, you know, CS versus stat,
versus E, and all that, how do we see that is all
kind of one problem here, and how do we educate our new workforce to kind of solve problems in this way. Thank you very much. (audience applauding) – So before we go to the Q&A session, let me just make two comments. One, Michael mentioned that
he will be giving a talk tomorrow morning, that’s for
the Data Science Foundation. I am not a person organizing that, but if you’re interested,
just go to Google and type Data Science Foundation workshop Purdue, and you can find the information. I believe it’s free for
undergraduate students and then for grads, I think it’s 10 bucks, just tell your adviser to pay that. (audience chuckling) So, yeah, so just do it.
– It’s a market. That’s pricing
– It’s marketing, okay. So that’s number one. Number two is that I got a
question from our colleagues, and I feel that I have the
burden to ask you this question, since we’re in the end, and
we are big basketball lovers. And what do you think about Larry Bird? Larry Bird? – Do I know what that is? – Ah, so Larry is a pretty
big basketball player during the time–
– Oh, Larry Bird, I’m sorry, never mind, I was in–
– Michael Jordan. – I was in Boston at MIT
as a young professor, so I definitely know who Larry Bird was. – [Stanley] What do you think about it? – They’re both great. – [Stanley] Okay, so Q&A time. Any question from the floor? – Yeah? – [Student] Thank you very much. – [Stanley] The mike is over there. – [Student] So, thank you very much for the fantastic presentation. I also agree that we
are very far away from actually smart computers and stuff. My question is in more to
your criteria on what is that is missing to achieve
that, and like what is this missing link that
can allow us to achieve that of a machine that
could learn or be conscious? Thank you. – Yeah, great question. So, I’m gonna be, kind
of say I don’t know, but so one way to answer is
that work on problems where it seems that you
really need some new, some more abstraction, you
need some more semantics. Semantics is, you know, the
fact that I’m talking to you is a semantic relationship
among us, right, and there are semantic networks out there, there’s kind of logical expressions, if you work in the area of
natural language processing, you have lots of data and
you try to make predictions, like what word comes next,
and you do things that neural nets can do pretty well, or translate strings to strings, but going down to a
semantic representation of understanding what’s being said, and then reasoning about that, they’re not doing, okay, really at all. But if you’re serious about that field, you try to build in that kind of thing, you try to engineer,
it’s a semantic network, or what’s called an
ontology often in industry. So a lot of industries
now may have, you know, ontology with 200,000 nodes in it, 200,000 nodes and graph of, you know, this person is a friend of this person, this person is, you know,
married to this person, so on and so forth, or
products relationships. And you try to bring those together with the machine learning
and sort of stuff, and it’s a big engineering
thing, all right. You can start to get
systems that can answer some simple questions or have
some very very simple dialogs. I’d say in 10 years
you’ll have things that do pretty good question and
answer kind of stuff, and even some very simple
dialogs in their own domains, and then they’ll kind of
break, as soon as you into a bigger collections
of people and all that. And by the end of our lifetimes,
maybe there will be some, like online, you know,
find a flight to Paris, and you can really
interact with the computer and have a real dialog about that, but it’s gonna be very
slow engineering progress, kinda like going to the moon,
you know, that level of big engineering effort that’s gonna be needed. Now, somewhere along that
maybe some magic will happen, and there’ll be a deeper understanding what kind of abstractions
are we talking about here, is it like logical forms,
or is there some other way to think about the representations, how come humans are so fluent at this, and so I think working on those problems is probably the best way to discover that, I don’t think looking at the
brain or the mind is, sadly, I mean, it’s just too complicated. But I think trying to build those systems will probably help. I’m not so sure there will be
magic, ’cause in this notion of being able to abstract
and have intelligence, like right now we’re communicating
at a very high level, that our computers will be
left in the dust, right. They are like, they are still kind of down to pixel level or the edges, and we’re up at this kind of,
you know, very abstract level. Any word in any language
is very very rich. You think about the word Not in English, think about all, what
does Not mean, all right, not today, not tomorrow, not
you, not this, not that, not, every one of those versions
of Not has a different subtle stated semantics, okay, everyone of them, and it depends on the context,
as semantics can shift. We all know all that, we don’t
even think about it, right. A computer has to learn
all that, but not just from looking at strings of data,
it’s gotta learn the context in which that sentence was uttered, so that you understand the
semantics of that, okay. Somehow, I don’t know what the
magic is to get there, okay. Now, so what if we found that
magic, how great would it be? Well, I’m not so sure, it’ll probably be, change lives a lot, but you know, we’d be, we just have a new human there, just happens to be artificial, that would be exciting to some, I’m not that, we have so many humans, why do we need, you know, another one.
(audience laughing) Really, I want more of these services that make human life better. And I think human life is a bit messed up right now in some ways, and I want them to be better. So I’m emphasizing these markets, ’cause I do see more intelligence
there of a different kind, that’s not about the human, that allows us to build better
things, and better systems, and better believable, trustable
things, and so, anyway. Yes? – [Student] Your online markets example, one question, one concern about this is you talk about what the consumer wants – Yeah. – [Student] What the
restaurant, for example, wants. – Yeah. – [Student] But often the, you
know, what the market maker in this sense wants, these
are like in the case of Uber, these examples, often there
kind of natural monopolies, these are things that work
very, the more you are monopoly, the better it works. Your goal is to accomplish that monopoly, and so if think, for example,
restaurant recommendations, my goal might be very well
served by having everybody show up at a restaurant,
not be able to get in, and they think, wow, this is
a great recommendation engine, it sent me to the restaurant
that everybody loves. – I’d push back against
that, I mean, and there’s, certainly, markets do not
solve it, quote unquote, and in fact you need regulated markets, and part of the whole story, what regulations are
appropriate for these markets. But in your example there,
if people are showing up and not getting served,
the utility is to eat well, and if no one’s eating
well, that’s broken, no one’s gonna play in
that market anymore, they gonna move to another
market where they can eat well. All right, and I don’t think that there’re some natural monopolies, but there are, you know,
if you do microeconomics, you learn there’s kind
of reasons for them, it’s not a typical market phenomenon to be a natural monopoly, And you can kinda break
them by doing things like, you know, loyalty programs. Why do we have so many airlines still? Why isn’t there just one airline? Well, you know, I have
my points on United, I’m not gonna go fly Delta,
you know, it sounds stupid, but really, that’s really important that there’s a little loyalty between a producer and a consumer. And that leads to
breaking apart monopolies, and so there is all, I’m
not a microeconomics person, but as usual in my economic
life, I like that I’m ignorant about a whole field, that just feels like their way of thinking is real, it doesn’t solve all the problems, it has the whole bunch of
other, that’s cool, I like that. And even in the ad world
which I was bashing a lot, I know when people start
to do online ad markets, they just didn’t use classical
victory auctions or whatever, from market design, that didn’t work, they had to develop some new ones. Same thing here, all right. But I’m gonna push back
against people that say, no, we know, markets don’t work, see all that unhappiness in
the world because of markets, that’s not what you’re saying, but there will be people saying that. And no, 3,000 years of human
development from, you know, the sticks, markets is
the number one reason why it’s happened, right. The ability of people come in and trade, and economic prosperity follows from that, so there’s something very robust and very healthy about
that, suitably regulated, and suitably transparent
with trust mechanisms, it is a path out of our current state that I want us to exploit better. – [Stanley] Okay, so we have limited time. Can we take one more question? – [Student] Thanks for your
lecture, professor Jordan, and I have a question that
nowadays, and there’s some new methods like curiosity
scheme or gyrant-based methods. – What kind was that? – [Student] Which is curiosity
– Curiosity. – [Student] Curiosity
scheme and which is related to the decision making,
so whether you think this field is, can combine with nonconvex or convex optimization. – You kind of down on a
particular algorithm there, and let me just say that
there is a lot of innovative thinking going on in kind
of neural network world, where people trying out stuff. A lot of it is reinventions of things, and a lot of it is people just sort of narrow down to this one thing. So curiosity, well, what
does that really mean? That’s a kind of a metaphor. For me it probably means,
you have some uncertainty, and you gonna sample in places where you’re little more uncertain, and you gonna favor that. Well, as you may know,
there’s a whole area of optimal experimental design, there’s a whole world of causal analysis, and then there’s the added literature, just what I was talking
about a minute ago. I don’t know which of
the k arms are the best, it’s not supervised learning, therefore, what if I pick each one of them 10 times, and I see which one is the
highest, and pick the highest. That’s provably a dumb
algorithm, all right, a better algorithm is to
have error bars around each one of the means that
I get, and I pick the one that has the highest error bar, okay, ’cause now if it’s really
high ’cause it’s good, that’s I’m gonna pick it,
but if it’s really good because I’m uncertain, I’ll pick it too. That’s curiosity in a very
clean mathematical way, and there’s a lot of theory there. So I don’t wanna, you know,
diminish people’s, you know, cleverness of thinking of new terminology and stuff like that, but you have, you know thinking of mechanisms like that in a world of neural nets,
you haven’t gone outside of the whole scope of the area that many people been been working on, and especially for the younger people, don’t just focus on neral
nets, you know, again, I love them, it’s been great
progress, it’s been fun to see, bu there’s this whole
broader control theories, statistics, et cetera, optimization, then if you’re a young person, you should be educating yourself in all of that, and then
be creative on top of that. – Thanks.
– Thank you. – Can I just follow one
question for real quickly, because there are a lot
of students in this room, and what would be your
advice to the students, if they wanted to do machine learning. – Yeah, he’s asking about
what should I advise students. So, one thing I didn’t
talk about today is that we have a data science
program at Berkeley, and we have actually
kind of a new division and a college emerging,
and it’s been a struggle with all the deans fighting it
and everything, just to say. Here you have a dean who’s
not trying to fight it, you’re lucky. But one of the things we’ve done bottom up without any deans helping us
with whatever we need help, we designed a bunch of classes
that are for undergrads, and sort of first class is
called Data 8 at Berkeley, I was on the team that designed it, and I’m now designing the follow up class, and we’re pretty proud of it. It is a class that you learn, it is for freshmen, and you are, assume they know no math or arithmetic, and you assume they know
maybe no computer programming, all right, so you gonna
teach them Python, all right, but you gonna teach just enough Python to do something interesting statistically. So, for example, I talked about A/B testing permutation test, where I got two columns of numbers, and if it’s really the same,
I can put them together, I can permute them, and I’m still in the
same null distribution, I can get a p-value, blah, blah, blah. I can describe that to you,
and you would understand it in about two minutes with
no math, no Greek symbols, no nothing, you get the
kind of the beauty of it, I think, and students do. Then you could say, how
do you do that in Python, well I need a list of some
kind, you could teach them enough Python to do that,
and then you could ask a really interesting conceptual
computational question, which is how do you do
a random permutation. How you do that? Right, I get n items in a
list, I wanna permute them and get a uniform or random permutation, so I’ll leave you with
thinking about that. The naive thing you’ll
think about is kind of swapping all pairs, that
gives you a permutation, but it’s cost of n
squared, that’s not good in the modern world. Is there a faster algorithm? I can tell you, the
answer is gonna be, yes. But we make students think about it, and about half of them kind
of figure it out, right. Then the cool thing is
that they put it in Python, and they can program, and then
we get some real world data. So a typical example we use is the, here is the ethnic composition
of juries in Alameda County, here is the population ethnic distribution in Alameda County, those
are two columns of numbers, are they the same or different. Of course they’re a little bit different, but as a statistician,
are they really different. And so students love that,
they can use their tools to actually get a p-value
for are the juries biased in Alameda County. And I can telly you, the juries are biased in Alameda County, and they quantify that, and
they can go onto all kinds of other problems, all right. So hopefully, it inspires you,
so we’re doing no math there, but you can see there’s like symmetries, there’s permutations,
there’s group theory somehow seen behind the scenes,
there’s a probability theory. And so then slowly over
the next three years we introduce a little bit
more math to support that. So what is the math,
well it’s probability, and it’s statistics,
it’s some optimization, it’s sort of some algorithms
and computer science, and some data structures,
but, you know, that, this kind of a modern stuff
that’s most useful to us, and then I think some econ, but you can kinda craft
your thing, but anyway, it’s our job as professors
is actually interlace these things in a single class. You look a the classical
way of teaching Python, they’ll teach the same syntax we do, and they get to an example,
it won’t be some statistics of A/B testing problem, it’ll be how do you do Fibonacci series. Well, Fibonacci are fine,
my 12-year-old loves them, but I don’t use Fibonacci
series in my life, I never will, but do I use A/B testing, yeah, or I mean, Jeff Bezos uses
A/B testing all day long. So we need to teach those kind of things, there’re inferential, so people talk about computational thinking,
that that’s taking over, well, no, it’s big part of it, but there’s a whole other part
about inferential thinking, of using algorithms to decide
what’s behind the data, not just process the data,
but what data come from, that’s inferential. So you have to, if you are
going to be in this field get those both styles of thinking, one you get mostly from
classical statistics, and one you get from computer science, but ideally, good universities
will actually blend them, and they won’t just put, lump it, you have to take all these
classes plus all these classes, it’ll be each class has a bit of a blend. Thanks for asking, yeah, it’s great. (audience applauding)