Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 1 – Introduction and Word Vectors

Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 1 – Introduction and Word Vectors

November 29, 2019 0 By Stanley Isaacs


Okay. Hello everyone. [LAUGHTER] Okay we should get started. Um, they’re actually are still quite a few seats left. If you wanna be really bold, there are a couple of seats right in front of me in the front row. If you’re less bolder a few over there. Um, but they’re also on some of the rows are quite a few middle seat. So if people wanted to be really civic minded some people could sort of squeeze towards the edges and make more accessible um, some of the seats that still exist in the classroom. Okay. Um, so, um, it’s really exciting and great to see so many people here. So I’m a hearty welcome to CS224N and occasionally also known as Ling 284 which is Natural Language Processing with Deep Learning. Um, as just a sort of a personal anecdote, is still sort of blows my mind that so many people turn up to this class these days. So, for about the first decade that I taught NLP here, you know the number of people I got each year was approximately 45. [LAUGHTER] So it’s an order of [LAUGHTER] magnitude smaller than it is now but guess it says quite a lot on about what a revolutionary impact that artificial intelligence in general and machine learning, deep learning, NLP are starting to have in modern society. Okay. So this is our plan for today. So, um, um, we’re really gonna get straight down to business today. So they’ll be a brief, very brief introduction some of the sort of course logistics, very brief discussion and talk about human language and word meaning and then we wanna get right into talking about um, the first thing that we’re doing which is coming up with word vectors and looking at the word2vec algorithm and that will then sort of fill up the rest of the class. There are still two seats right in the front row for someone who wants to sit right in front of me, just letting you know [LAUGHTER]. Okay. Okay. So here are the course logistics in brief. So I’m Christopher Manning, the person who bravely became the head TA is Abigail See is right there. And then we have quite a lot of wonderful TA’s. To the people who are wonderful TA’s just sort of stand up for one moment. So, um, [LAUGHTER] we have some sense for wonderful TAs. [LAUGHTER] Okay great. Um, okay. So you know when the lecture is because you made it here and so welcome also to SCPD people. This is also an SCPD class and you can watch it on video. But we love for Stanford students to turn up and show their beautiful faces in the classroom. Okay. So, um, the web-page has all the info about syllabus et cetera et cetera. Okay. So this class what do we hope to teach? So, one thing that we wanna teach is, uh, you know, an understanding of effective modern methods for deep learning. Starting off by reviewing some of the basics and then particularly talking about the kinds of techniques including um, recurrent networks and attention that are widely used for natural language processing models. A second thing we wanna teach is a big picture understanding of human languages and some of the difficulties in understanding and producing them. Of course if you wanna know a lot about human languages, there’s a whole linguistics department and you can do a lot of courses of that. Um, but so I wanna give at least some appreciation so you have some clue of what are the challenges and difficulties and varieties of human languages. And then this is also kind of a practical class. Like we actually wanna teach you how you can build practical systems that work for some of the major parts of NLP. So if you go and get a job at one of those tech firms and they say “Hey, could you build us a named entity recognizer?” You can say “Sure, I can do that.” And so for a bunch of problems, obviously we can’t do everything, we’re gonna do word meaning, dependency parsing, machine translation and you have an option to do question answering, I’m actually building systems for those. If you’d been talking to friends who did the class in the last couple of years, um, here are the differences for this year just to get things straight. Um, so we’ve updated some of the content of the course. So, uh, between me and guest lectures there’s new content. Well that look bad. Wonder if that will keep happening, we’ll find out. There’s new content and on various topics that are sort of developing areas. One of the problems with this course is really big area of deep learning at the moment is still just developing really really quickly. So, it’s sort of seems like one-year-old content is already things kind of data and we’re trying to update things. A big change that we’re making this year is we’re having five-one week assignments instead of three-two week assignments at the beginning of the course and I’ll say a bit more about that in a minute. Um, this year we’re gonna use PyTorch instead of TensorFlow, and we’ll talk about that more later too. Um, we’re having the assignments due before class on either Tuesday or Thursday. So you’re not distracted and can come to class. So starting off, um, yeah. So we’re trying to give an easier, gentler ramp-up but on the other hand a fast ramp-up. So we’ve got this first assignment which is sort of easy, uh, but it’s available right now and is due next Tuesday. And the final thing is we’re not having a midterm this year. Um, okay. So this is what we’re doing. So there are five of these assignments that I just mentioned. Um, So six percent for the first one, 12 percent for each of the other ones, um, and, I already said that. We’re gonna use gradescope for grading. It’ll be really help out the TAs if you could use your SUnet ID as your gradescope account ID. Um, so then for the second part of the course, people do a final project and there are two choices for the final project. You can either do our default final project, which is a good option for many people, or you can do a custom final project and I’ll talk about that in the more in the beginning. This is not working right. Um, and so then at the end we have a final poster presentation session at which your attendance is expected, and we’re gonna be having that Wednesday in the evening. Probably not quite five hours but it’ll be within that window, we’ll work out the details in a bit. Three percent for participation, see the website for details. Six late days, um, collaboration, like always in computer science classes, we want you to do your own work and not borrow stuff from other people’s Githubs and so we really do emphasize that you should read and pay attention to collaboration policies. Okay. So here’s the high level plan for the problem sets. So, homework one available right now, is a hopefully easy on ramp. That’s on iPython notebook, just help get everyone up to speed. Homework two is pure Python plus numpy but that will start to kind of teach you more about the sort of underlying, how do we do deep learning. If you’re not so good or a bit rusty or never seen um, Python or numpy, um, we’re gonna have an extra section on Friday. So Friday from 1:30 to 2:50 um, in Skilling Auditorium, we’ll have a section that’s a Python review. That’s our only plan section at the moment, we’re not gonna have a regular section. Um, so encourage to go to that and that will also be recorded for SCPD and available for video as well. Um, then Homework three um, will start us on using PyTorch. And then homeworks four and five we’re then gonna be using py- PyTorch on GPU and we’re actually gonna be using Microsoft Azure with big thank yous to the kind Microsoft Azure people who have sponsored our GPU computing for the last um, three years. Um, yes. So basically I mean all of modern deep learning has moved to the use of one or other of the large deep learning libraries like PyTorch TensorFlow, Chainer or MXNet um, et cetera and then doing the computing on GPU. So of course since we’re in the one building, we should of course be using, um, GPUs [LAUGHTER] but I mean in general the so parallelisms scalability of GPUs is what’s powered most of modern deep learning. Okay. The final project. So for the final project there are two things that you can do. So we have a default final project which is essentially our final project in a box. And so this is building a question answering system and we do it over the squad dataset. So what you build and how you can improve your performance is completely up to you. It is open-ended but it has an easier start, a clearly defined objective and we can have a leaderboard for how well things are working. Um, so if you don’t have a clear research objective that can be a good choice for you or you can propose the custom Final Project and assuming it’s sensible, we will approve your custom final project, we will give you feedback, um, form someone as a mentor, um, and either way for only the final project we allow teams of one, two or three. For the homework should expect it to do them yourself. Of course you can chat to people in a general way about the problems. Okay. So that is the course. All good, and not even behind schedule yet. Okay. So the next section is human language and word meaning.Um. You know, if I was um, really going to tell you a lot about human language that would take a lot of time um, which I don’t really have here. So I’m just going to tell you um, two anecdotes about human language. And the first is this XKCD cartoon. Um, and I mean this isn’t, and I don’t know why that’s happening. I’m not sure what to make of that. Um, so, I actually really liked this XKCD cartoon. It’s not one of the classic ones that you see most often around the place, but I actually think it says a lot about language and is worth thinking about. Like I think a lot of the time for the kind of people who come to this class who are mainly people like CS people, and EE people and random others. There’s some other people I know since these people linguists and so on around. But for a lot of those people like, you’ve sort of spent your life looking at formal languages and the impression is that sort of human language as a sort of somehow a little bit broken formal languages, but there’s really a lot more to it than that, right? That language is this amazing um, human created system that is used for all sorts of purposes and is adaptable to all sorts of purposes. So you can do everything from describing mathematics and human language um to sort of nuzzling up to your best friend and getting them to understand you better. So there’s actually an amazing thing of human language. Anyway, I’ll just read it. Um, so it’s the first person, the dark haired person says, “Anyway, I could care less.” And her friend says, “I think you mean you couldn’t care less.” Saying you could care less implies you care at least some amount. And the dark haired person says, “I don’t know, we’re these unbelievably complicated brains drifting through a void trying in vain to connect with one another by blindly flinging words out into the darkness.” Every choice of phrasing and spelling, and tone, and timing carries countless signals and contexts and subtexts and more. And every listener interprets those signals in their own way. Language isn’t a formal system, language is glorious chaos. You can never know for sure what any words will mean to anyone. All you can do is try to get better at guessing how your words affect people so you can have a chance of finding the ones that will make them feel something like what you want them to feel. Everything else is pointless. I assume you’re giving me tips on how you interpret words because you want me to feel less alone. If so, thank you. That means a lot. But if you’re just running my sentences past some mental checklist so you can show off how well you know it, then I could care less. [NOISE] Um, and so I think um, I think actually this has some nice messages about how language is this uncertain evolved system of communication but somehow we have enough agreed meaning that you know, we can kind of pretty much communicate. But we’re doing some kind of you know probabilistic inference of guessing what people mean and we’re using language not just for the information functions but for the social functions etc etc. Okay. And then here’s my one other thought I had review about language. So, essentially if we want to have artificial intelligence that’s intelligent, what we need to somehow get to the point of having compu- computers that have the knowledge of human beings, right? Because human beings have knowledge that gives them intelligence. And if you think about how we sort of convey knowledge around the place in our human world, mainly the way we do it is through human language. You know, some kinds of knowledge you can sort of work out for yourself by doing physical stuff right, I can hold this and drop that and I’ve learnt something. So I have to learn a bit of knowledge there. But sort of most of the knowledge in your heads and why you’re sitting in this classroom has come from people communicating in human language to you. Um, so one of the famous, most famous steep learning people Yann Le Cun, he likes to say this line about, oh, you know really I think that you know there’s not much difference between the intelligence of human being and orangutan. And I actually think he’s really wrong on that. Like the sense in which he means that is, an orangutan has a really good vision system. Orangutans have very good you know control of their arms just like human beings for picking things up. Orangutans um can use tools um and orangutans can make plans so that if you sort of put the food somewhere where they have to sort of move the plank to get to the island with the food they can do a plan like that. So yeah, in a sense they’ve got a fair bit of intelligence but you know, sort of orangutans just aren’t like human beings. And why aren’t they like human beings? And I’d like to suggest to you the reason for that is what human beings have achieved is, we don’t just have sort of one computer like a you know dusty old IBM PC in your mother’s garage. What we have is a human computer network. And the way that we’ve achieved that human computer network is that, we use human languages as our networking language. Um, and so, when you think about it um, so on any kind of evolutionary scale language is super super super super recent, right? That um, creatures have had vision for people don’t quite know but you know, maybe it’s 75 million years or maybe it’s longer, right? A huge length of time. How long have human beings have had language? You know people don’t know that either because it turns out you know, when you have fossils, you can’t knock the skull on the side and say, do you not have language. Um, but you know, most people estimate that sort of language is a very recent invention before current human beings moved out of um, out of Africa. So that many people think that we’ve only had language for something like a 100,000 years or something like that. So that’s sort of you know blink of an eye on the evolutionary timescale. But you know, it was the development of language [inaudible] that sort of made human beings invisible- [NOISE] in invincible, right? It wasn’t that, human beings um, developed poison fangs or developed ability to run faster than any other creature or put a big horn on their heads or something like that, right? You know, humans are basically pretty puny um, but they had this um, unbeatable advantage that they could communicate with each other and therefore work much more effectively in teams. And that sort of basically made human beings invincible. But you know, even then humans were kind of limited, right? That kind of got you to about the Stone Age right, where you could bang on your stones and with the right kind of stone make something sharp to cut with. Um, what got humans beyond that, was that they invented writing. So writing was then an ability where you could take knowledge not only communicated um mouth to mouth to people that you saw. You could put it down on your piece of papyrus so your clay tablet or whatever it was at first and that knowledge could then be sent places. It could be sent spatially around the world and it could then be sent temporally through time. And well, how old is writing? I mean, we sort of basically know about how old writing is, right? That writing is about 5,000 years old. It’s incredibly incredibly recent on this scale of evolution but you know, essentially writing was so powerful as a way of having knowledge that then in those 5,000 years that enabled human beings to go from stone age sharp piece or flint to you know, having iPhones and all of these things, all these incredibly sophisticated devices. So, language is pretty special thing I’d like to suggest. Um, but you know, if I go back to my analogy that sort of it’s allowed humans to construct a networked computer that is way way more powerful than um, just having individual creatures as sort of intelligent like an orangutan. Um, and you compare it to our computer networks, it’s a really funny kind of network, right? You know that these days um, we have networks that run around where we have sort of large network bandwidth, right? You know, we might be frustrated sometimes with our Netflix downloads but by and large you know, we can download hundreds of megabytes really easily and quickly. And we don’t think that’s fast enough, so we’re going to be rolling out 5G networks. So it’s an order of magnitude faster again. I mean, by comparison to that, I mean, human language is a pathetically slow network, right? That the amount of information you can convey by human language is very slow. I mean you know, whatever it is I sort of speak at about 15 words a second right, you can start doing um, your information theory if you know some right? But um, you don’t actually get much bandwidth at all. And that then leads- so you can think of, how does it work then? So, humans have come up with this incredibly impressive system which is essentially form of compression. Sort of a very adaptive form of compression, so that when we’re talking to people, we assume that they have an enormous amount of knowledge in their heads which isn’t the same as but it’s broadly similar to mine when I’m talking to you right? That you know what English words mean, and you know a lot about how the wor- world works. And therefore, I can say a short message and communicate only a relatively short bit string and you can actually understand a lot. All right? So, I can say sort of whatever you know, imagine a busy shopping mall and that there are two guys standing in front of a makeup counter, and you know I’ve only said whatever that was sort of about 200 bits of information but that’s enabled you to construct a whole visual scene that we’re taking megabytes to um, represent as an image. So, that’s why language is good. Um, so from that more authorial level, I’ll now move back to the concrete stuff. What we wanna do in this class is not solve the whole of language, but we want to represent, um, the meaning of words, right? So, a lot of language is bound up in words and their meanings and words can have really rich meanings, right? As soon as you say a word teacher, that’s kinda quite a lot of rich meaning or you can have actions that have rich meaning. So, if I say a word like prognosticate or, um, total or something you know, these words that have rich meanings and a lot of nuance on them. And so we wanna represent meaning. And so, the question is what is meaning? So, you can of course you can- dictionaries are meant to tell you about meanings. So, you can look up dictionaries um, and Webster says sort of tries to relate meaning to idea. The idea that is represented by a word or a phrase. The idea that a person wants to express by word signs et cetera. I mean, you know, you could think that these definitions are kind of a cop-out because it seems like they’re rewriting meaning in terms of the word idea, and is that really gotten you anywhere. Um, how do linguists think about meaning? I mean, the most common way that linguists have thought about meaning is an idea that’s called denotational semantics which is also used in programming languages. So, the idea of that is we think of meaning as what things represent. So, if I say the word chair, the denotation of the word chair includes this one here and that one, that one, that one, that one. And so, the word chair is sort of representing all the things that are chairs and you can sort of, um, you can then think of something like running as well that you know there’s sort of sets of actions that people can partake that- that’s their denotation. And that’s sort of what you most commonly see in philosophy or linguistics as denotation. It’s kind of a hard thing to get your hands on, um, computationally. So, um, what type of people most commonly do or use the most commonly do I guess I should say now for working out the meaning of words on the computer that commonly that turn to something that was a bit like a dictionary. In particular favorite online thing was this online thesaurus called WordNet which sort of tells you about word meanings and relationships between word meanings. Um, so this is just giving you the very slices sense of, um, of what’s in WordNet. Um, so this is an actual bit of Python code up there which you can, um, type into your computer and run and do this for yourself. Um, so this uses a thing called NLTK. Um, so NLTK is sort of like the “Swiss Army Knife of NLP” meaning that it’s not terribly good for anything, but it has a lot of basic tools. So, if you wanted to do something like just get some stuff out of WordNet and show it, it’s the perfect thing to use. Um, okay. So, um, from NLTK I’m importing WordNet and so then I can say, “Okay, um, for the word good tell me about the synonym sets with good participates in.” And there’s good goodness as a noun. There is an adjective good. There’s one estimable good, honorable, respectable. Um, this looks really complex and hard to understand. But the idea of word- WordNet makes these very fine grain distinctions between senses of a word. So, what sort of saying for good, um, there’s what some sensors where it’s a noun, right? That’s where you sort of, I bought some goods for my trip, right? So, that’s sort of, um, one of these noun sensors like this one I guess. Um, then there are adjective sensors and it’s trying to distinguish- there’s a basic adjective sense of good being good, and then in certain, um, sensors, there are these extended sensors of good in different directions. So, I guess this is good in the sense of beneficial, um, and this one is sort of person who is respectable or something. He’s a good man or something like that, right? So, um, but you know, part of what’s kind of makes us think very problematic and practice to use is it tries to make all these very fine-grain differences between sensors that are a human being can barely understand the difference between them um, and relate to. Um, so you can then do other things with WordNet. So, this bit of code you can sort of well walk up and is a kind of hierarchy. So, it’s kinda like a traditional, um, database. So, if I start with a panda and say- [NOISE] if I start with a panda. Um, and walk up, um, the pandas are [inaudible]. Maybe you’d guys to bio which are carnivores, placentals, mammals, blah, blah, blah. Okay, so, um, that’s the kind of stuff you can get out to- out of WordNet. Um, you know, in practice WordNet has been. Everyone sort of used to use it because it gave you some sort of sense of the meaning of the word. But you know it’s also sort of well-known. It never worked that well. Um, so you know that sort of the synonym sets miss a lot of nuance. So, you know one of the synonym sets for good has proficient in it and good sort of like proficient but doesn’t proficient have some more connotations and nuance? I think it does. Um, WordNet like most hand built resources is sort of very incomplete. So, as soon as you’re coming to new meanings of words, or new words and slang words, well then, that gives you nothing. Um, it’s sort of built with human labor, um, in ways that you know it’s hard to sort of create and adapt. And in particular, what we want to focus on is, seems like a basic thing you’d like to do with words and it’s actually at least understand similarities and relations between the meaning of words. And it turns out that you know WordNet doesn’t actually do that that well because it just has these sort of fixed discrete synonym sets. So, if you have a words in a synonym said that there’s sort of a synonym and maybe not exactly the same meaning, they’re not in the same synonyms set, you kind of can’t really measure the partial resemblance as a meaning for them. So, if something like good and marvelous aren’t in the same synonym set, but there’s something that they share in common that you’d like to represent. Okay. So, um, that’s kinda turn to lead into us wanting to do something different and better for word meaning. And, um, before getting there I just sort of wanna again sort of build a little from traditional NLP. So, traditional NLP in the context of this course sort of means Natural Language Processing up until approximately 2012. There were some earlier antecedents but as basically, um, in 2013 that things really began to change with people starting to use neural net style representations for natural language processing. So, up until 2012, um, standardly you know we had words. They are just words. So, we had hotel conference motel. They were words, and we’d have you know lexicons and put words into our model. Um, and in neural networks land this is referred to as a localist representation. I’ll come back to those terms again next time. But that’s sort of meaning that for any concept there’s sort of one particular, um, place which is the word hotel or the word motel. A way of thinking about that is to think about what happens when you build a machine learning model. So, if you have a categorical variable like you have words with the choice of word and you want to stick that into some kind of classifier in a Machine Learning Model, somehow you have to code that categorical variable, and the standard way of doing it is that you code it by having different levels of the variable which means that you have a vector, and you have, this is the word house. This is the word cat. This is the word dog. This is the word some chairs. This is the word agreeable. This is the word something else. This is the word, um, hotel, um, and this is another word for something different, right? So that you have put a one at the position and neural net land we call these one-hot vectors, and so these might be, ah, one-hot vectors for hotel and motel. So, there are a couple of things that are bad here. Um, the one that’s sort of, ah, practical nuisance is you know languages have a lot of words. Ah, so, it’s sort of one of those dictionaries that you might have still had in school that you probably have about 250,000 words in them. But you know, if you start getting into more technical and scientific English it’s easy to get to a million words. I mean, actually the number of words that you have in a language, um, like English is actually infinite because we have these processes which are called derivational morphology, um, where you can make more words by adding endings onto existing words. So, you know you can start with something like paternalist, fatherly, and then you can sort of say from maternal, you can say paternalist, or paternalistic, paternalism and pa- I did it paternalistically. Right? Now all of these ways that you can bake bigger words by adding more stuff into it. Um, and so really you end up with an infinite space of words. Um, yeah. So that’s a minor problem, right? We have very big vectors if we want to represent a sensible size vocabulary. Um, but there’s a much bigger problem than that, which is, well, precisely what we want to do all the time, is we want to, sort of, understand relationships and the meaning of words. So, you know, an obvious example of this is web search. So, if I do a search for Seattle motel, it’d be useful if it also showed me results that had Seattle hotel on the page and vice versa because, you know, hotels and motels pretty much the same thing. Um, but, you know, if we have these one-hot vectors like we had before they have no s- similarity relationship between them, right? So, in math terms, these two vectors are orthogonal. No similarity relationship between them. Um, and so you, kind of, get nowhere. Now, you know, there are things that you could do, I- I just showed you WordNet’s. WordNet’s shows you some synonyms and stuff. So that might help a bit. There are other things you could do. You could sort of say, well wait, why don’t we just build up a big table where we have a big table of, um, word similarities, and we could work with that. And, you know, people used to try and do that, right? You know, that’s sort of what Google did in 2005 or something. You know, it had word similarity tables. The problem with doing that is you know, we were talking about how maybe we want 500,000 words. And if you want to build up then a word similarity table out of our pairs of words from one-hot representations, um, you- that means that the size of that table, as my math is pretty bad, is it 2.5 trillion? It’s some very big number of cells in your similarity, um, matrix. So that’s almost impossible to do. So, what we’re gonna instead do is explore a method in which, um, we are going to represent words as vectors, in a way I’ll show you just, um, a minute in such a way that just the representation of a word gives you their similarity with no further work. Okay. And so that’s gonna lead into these different ideas. So, I mentioned before denotational semantics. Here’s another idea for representing the meaning of words, um, which is called distributional semantics. And so the idea of distributional semantics is, well, how are we going to represent the meaning of a word is by looking at the contexts, um, in which it appears. So, this is a picture of JR Firth who was a British linguist. Um, he’s famous for this saying, “You shall know a word by the company it keeps.” Um, but another person who’s very famous for developing this notion of meaning is, um, the philosopher Ludwig- Ludwig Wittgenstein in his later writings, which he referred to as a use theory of meeting- meaning. Well, actually he’s- he used some big German word that I don’t know, but, um, we’ll call it a use theory of meaning. And, you know, essentially the point was, well, you know, if you can explain every- if- if you can explain what contexts it’s correct to use a certain word, versus in what contexts would be the wrong word to use, this maybe gives you bad memories of doing English in high school, when people said, ah, that’s the wrong word to use there, um, well, then you understand the meaning of the word, right? Um, and so that’s the idea of distributional semantics. And it’s been- so one of the most successful ideas in modern statistical NLP because it gives you a great way to learn about word meaning. And so what we’re gonna do is we’re going to say, haha, I want to know what the word banking means. So, I’m gonna grab a lot of texts, which is easy to do now when we have the World Wide Web, I’ll find lots of sentences where the word banking is used, Government debt problems turning into banking crises as happened in 2009. And both these- I’m just going to say all of this stuff is the meaning of the word banking. Um, that those are the contexts in which the word banking is used. And that seems like very simple and perhaps even not quite right idea, but it turns out to be a very usable idea that does a great job at capturing meaning. And so what we’re gonna do is say rather than our old localist representation we’re now gonna represent words in what we call a distributed representation. And so, for the distributed representation we’re still going to [NOISE] represent the meaning of a word as a numeric vector. But now we’re going to say that the meaning of each word is, ah, smallish vector, um, but it’s going to be a dense vector where by all of the numbers are non-zero. So the meaning of banking is going to be distributed over the dim- dimensions of this vector. Um, now, my vector here is of dimension nine because I want to keep the slide, um, nice. Um, life isn’t quite that good in practice. When we do this we use a larger dimensionality, kinda, solid the minimum that people use is 50. Um, a typical number that you might use on your laptop is 300 if you want to really max out performance, um, maybe 1,000, 2,000, 4,000. But, you know, nevertheless [NOISE] orders of magnitude is smaller compared to a length 500,000 vector. Okay. So we have words with their vector representations. And so since each word is going to have a vector, um, representation we then have a vector space in which we can place all of the words. Um, and that’s completely unreadable, um, but if you zoom into the vector space it’s still completely unreadable. But if you zoom in a bit further, um, you can find different parts of this space. So here’s the part that where countries attending to, um, exist Japanese, German, French, Russian, British Australian American, um, France, Britain, Germany et cetera. And you can shift over to a different part of the space. So here’s a part of the space where various verbs are, so has have, had, been, be. Oops. Um, um, [inaudible] be always was where. You can even see that some morphological forms are grouping together, and things that sort of go together like say, think expect to things that take those, kind of, compliment. He said or thought something. Um, they group together. Now, what am I actually showing you here? Um, you know, really this was built from, ah, 100 dimensional word vectors. And there is this problem is really hard to visualize 100 dimensional word vectors. So, what is actually happening here is these, um, 100 dimensional word vectors are being projected down into two-dimensions, and you’re so- seeing the two-dimensional view, which I’ll get back to later. Um, so, on the one hand, um, whenever you see these pictures you should hold on to the your wallet because there’s a huge amount of detail on the original vector space that got completely killed and went away, um, in the 2D projection, and indeed some of what push things together in the 2D, um, projection may really, really, really misrepresent what’s in the original space. Um, but even looking at these 2D representations, the overall feeling is, my gosh this actually sort of works, doesn’t it? Um, we can sort of see similarities, um, between words. Okay. So, um, ha- so that was the idea of what we want to do. Um, the next part, um, is then how do we actually go about doing it? I’ll pause for breath for half a minute. Has anyone got a question they’re dying to ask? [NOISE] Yeah. Where were the- the vectors is each, um, had a different order in each contact, like, say the first decimal vector, second decimal vector, are those standard across all theory or people choose them themselves? Um, they’re not standards across NLP um and they’re not chosen at all. So what we’re gonna present is a learning algorithm. So where we just sort of shuffle in lots of text and miraculously these word vectors come out. And so the l- learning algorithm itself decides the dimensions. But um that actually reminds me of something I sort of meant to say which was yeah, I mean, since this is a vector space, in some sense the dimensions over the arbitrary right, because you can you know just have your basis vectors in any different direction and you could sort of re-represent, um the words in the vector space with a different set of basics, basis vectors and it’d be exactly the same vector space just sort of rotate around to your new um, vectors. So, you know, you shouldn’t read too much into the sort of elements. So, it actually turns out that because of the way a lot of deep learning um operations work, some things they do, do element-wise. So that the dimensions do actually tend to get some meaning to them it turns out. But um, though I think I really wanted to say was, that you know one thing we can just think of is how close things are in the vector space and that’s a notion of meaning similarity that we are going to exploit. But you might hope that you get more than that, and you might actually think that there’s meaning in different dimensions and directions in the word vector space. And the answer to that is there is and I’ll come back to that a bit later. Okay. Um, so in some sense this thing that had the biggest impact um in sort of turning the world of NLP in a neural networks direction was that picture. Um, was this um algorithm that um Thomas Mikolov came up with in 2013 called the word2vec algorithm. So it wasn’t the first work and having distributed representations of words. So there was older work from Yoshua Bengio that went back to about the sort of turn on the millennium, that somehow it’s sort of hadn’t really sort of hit the world over their head and had a huge impact and has really sort of Thomas Mikolov showed this very simple, very scalable way of learning vector representations of um words and that sort of really opened the flood gates. And so that’s the algorithm that I’m going to um show now. Okay. So the idea of this algorithm is you start with a big pile of text. Um, so wherever you find you know web pages on newspaper articles or something, a lot of continuous text, right? Actual sentences because we want to learn wo- word meaning context. Um, NLP people call a large pile of text a corpus. And I mean that’s just the Latin word for body, right? It’s a body of text. Important things to note if you want to seem really educated is in Latin, this is a fourth declensions noun. So the plural of corpus is corpora. And whereas if you say core Pi everyone will know that you didn’t study Latin in high school. [LAUGHTER] Um, okay. Um, so right- so we then want to say that every word um in a- in a fixed vocabulary which would just be the vocabulary the corpus is um represented by a vector. And we just start those vectors off as random vectors. And so then what we’re going to do is do this big iterative algorithm where we go through each position in the text. We say, here’s a word in the text. Let’s look at the words around it and what we’re going to want to do is say well, the meaning of a word is its contexts of use. So we want the representation of the word in the middle to be able to predict the words that are around it and so we’re gonna achieve that by moving the position of the word vector. And we just repeat that a billion times and somehow a miracle occurs and outcomes at the end we have a word vector space that looks like a picture I showed where it has a good meaning of word meet good representation of word meaning. So slightly more, um, um, slightly more um graphically right. So here’s the situation. So we’ve got part of our corpus problems turning into banking crisis, and so what we want to say is well, we want to know the meaning of the word into and so we’re going to hope that its representation can be used in a way that’ll make precise to predict what words appear in the context of into because that’s the meaning of into. And so we’re going to try and make those predictions, see how well we can predict and then change the vector representations of words in a way that we can do that prediction better. And then once we’ve dealt with into, we just go onto the next word and we say, okay, let’s take banking as the word. The meaning of banking is predicting the contexts in which banking occurs. Here’s one context. Let’s try and predict these words that occur around banking and see how we do and then we’ll move on again from there. Okay. Um, sounds easy so far. Um, [NOISE] now we go on and sort of do a bit more stuff. Okay. So overall, we have a big long corpus of capital T words. So if we have a whole lot of documents we just concatenate them all together and we say, okay, here’s a billion words, and so big long list of words. And so what we’re gonna do, is for the first um product we’re going to sort of go through all the words and then for the second product, we’re gonna say- we’re gonna choose some fixed size window, you know, it might be five words on each side or something and we’re going to try and predict the 10 words that are around that center word. And we’re going to predict in the sense of trying to predict that word given the center word. That’s our probability model. And so if we multiply all those things together, that’s our model likelihood is how good a job it does at predicting the words around every word. And that model likelihood is going to depend on the parameters of our model which we write as theta. And in this particular model, the only parameters in it is actually going to be the vector representations we give the words. The model has absolutely no other parameters to it. So, we’re just going to say we’re representing a word with a vector in a vector space and that representation of it is its meaning and we’re then going to be able to use that to predict what other words occur in a way I’m about to show you. Okay. So, um, that’s our likelihood and so what we do in all of these models is we sort of define an objective function and then we’re going to be, I want to come up with vector representations of words in such a way as to minimize our objective function. Um, so objective function is basically the same as what’s on the top half of the slide, but we change a couple of things. We stick a minus sign in front of it so we can do minimization rather than maximization. Completely arbitrary makes no difference. Um, we stick a one and T in front of it, so that we’re working out the sort of average as of a goodness of predicting for each choice of center word. Again, that sort of makes no difference but it kinda keeps the scale of things ah not dependent on the size of the corpus. Um, the bit that’s actually important is we stick a log in front of the function that was up there um because it turns out that everything always gets nice. So when you stick logs and find the products um when you’re doing things like optimization. So, when we do that we then got a log of all these products which will allow us to turn things you know, into a sums of the log of this probability and we’ll go through that again um in just a minute. Okay. Um, and so if we can mi- if we can change our vector representations of these words so as to minimize this J of theta, that means we’ll be good at predicting words in the context of another word. So then, that all sounded good but it was all dependent on having this probability function where you wanna predict the probability of a word in the context given the center word and the question is, how can you possibly do that? Um, well um, remember what I said is actually our model is just gonna have vector representations of words and that was the only parameters of the model. Now, that’s, that’s almost true. It’s not quite true. Um, we actually cheat slightly. Since we actually propose two vector representations for each word and this makes it simpler to do this. Um, you cannot do this, there are ways to get around it but this is the simplest way to do it. So we have one vector for word when it’s the center word that’s predicting other words but we have a second vector for each word when it’s a context word, so that’s one of the words in context. So for each word type, we have these two vectors as center word, as context word. Um, so then we’re gonna work out this probability of a word in the context, given the center word, purely in terms of these vectors and the way we do it is with this equation right here, which I’ll explain more in just a moment. So we’re still on exactly the same situation, right? That we’re wanting to work out probabilities of words occurring in the context of our center word. So the center word is C and the context words represented with O and these [inaudible] slide notation but sort of, we’re basically saying there’s one kind of vector for center words is a different kind of vector for context words and we’re gonna work out this probabilistic prediction um, in terms of these word vectors. Okay. So how can we do that? Well, the way we do it is with this um, formula here which is the sort of shape that you see over and over again um, in deep learning with categorical staff. So for the very center bit of it, the bit in orange are more the same thing occurs in the um, denominator. What we’re doing there is calculating a dot product. So, we’re gonna go through the components of our vector and we’re gonna multiply them together and that means if um, different words have B components of the same sign, plus or minus, in the same positions, the dot product will be big and if they have different signs or one is big and one is small, the dot product will be a lot smaller. So that orange part directly calculates uh, sort of a similarity between words where the similarity is the sort of vectors looking the same, right? Um, and so that’s the heart of it, right? So we’re gonna have words that have similar vectors, IS close together in the vector space have similar meaning. Um, so for the rest of it- um, so the next thing we do is take that number and put an X around it. So, um, the exponential has this nice property that no matter what number you stick into it, because the dot product might be positive or negative, it’s gonna come out as a positive number and if we eventually wanna get a probability, um, that’s really good. If we have positive numbers and not negative numbers, um, so that’s good. Um, then the third part of which is the bid in blue is we wanted to have probabilities and probabilities are meant to add up to one and so we do that in the standard, dumbest possible way. We sum up what this quantity is, that every different word in our vocabulary and we divide through by it and so that normalizes things and turns them into a probability distribution. Yeah, so there’s sort of in practice, there are two parts. There’s the orange part which is this idea of using dot product and a vector space as our similarity measure between words and then the second part is all the rest of it where we feed it through what we refer to a news all the time as a softmax distribution. So the two parts of the expen normalizing gives you a softmax distribution. Um, and softmax functions will sort of map any numbers into a probability distribution always for the two reasons that I gave and so, it’s referred to as a softmax um, because it works like a softmax, right? So if you have numbers, you could just say what’s the max of these numbers, um, and you know that’s sort of a hot- if you sort of map your original numbers into, if it’s the max of the max and everything else is zero, that’s sort of a hard max. Um, soft- this is a softmax because the exponenti- you know, if you sort of imagine this but- if we just ignore the problem negative numbers for a moment and you got rid of the exp, um, then you’d sort of coming out with a probability distribution but by and large it’s so be fairly flat and wouldn’t particularly pick out the max of the different XI numbers whereas when you exponentiate them, that sort of makes big numbers way bigger and so, this, this softmax sort of mainly puts mass where the max’s or the couple of max’s are. Um, so that’s the max part and a soft part is that this isn’t a hard decisions still spreads a little bit of probability mass everywhere else. Okay, so now we have uh, loss function. We have a loss function with a probability model on the inside that we can build and so what we want to be able to do is then um, move our vector representations of words around so that they are good at predicting what words occur in the context of other words. Um, and so, at this point what we’re gonna do is optimization. So, we have vector components of different words. We have a very high-dimensional space again but here, I’ve just got two for the picture and we’re gonna wanna say how- how can we minimize this function and we’re going to want to jiggle the numbers that are used in the word representations in such a way that we’re walking down the slope of this space. I walking down the gradient and um, then we’re gonna minimize the function we found good representations for words. So doing this for this case, we want to make a very big vector in a very high-dimensional vector space of all the parameters of our model and the only parameters that this model has is literally the vector space representations of words. So if there are a 100 dimensional word representations, they’re sort of a 100 parameters for aardvark and context, 100 parameters for the word a- in context et cetera going through, 100 parameters for the word aardvark [NOISE] as a center word et cetera, et cetera through that gives us a high big vector of parameters to optimize and we’re gonna run this optimization and then um, move them down. Um, [NOISE] yeah so that’s essentially what you do. Um, I sort of wanted to go through um, the details of this um, just so we’ve kind of gone through things concretely to make sure everyone is on the same page. Um, so I suspect that, you know, if I try and do this concretely, um, there are a lot of people um, that this will bore and some people that are- will bore very badly, um, so I apologize for you, um, but you know, I’m hoping and thinking that there’s probably some people who haven’t done as much of this stuff recently and it might just actually be good to do it concretely and get everyone up to speed right at the beginning. Yeah? [inaudible] how do we calculate [inaudible] specifically? Well, so, we- so the way we calculate the, the U and V vectors is we’re literally going to start with a random vector for each word and then we iteratively going to change those vectors a little bit as we learn. And the way we’re going to work out how to change them is we’re gonna say, “I want to do optimization,” and that is going to be implemented as okay. We have the current vectors for each word. Let me do some calculus to work out how I could change the word vectors, um, to mean, that the word vectors would calculate a higher probability for the words that actually occur in contexts of this center word. And we will do that, and we’ll do it again and again and again, and then will eventually end up with good word vectors. Thank you for that question, cause that’s a concept that you’re meant to have understood. Is that how this works and maybe I didn’t explain that high-level recipe well enough, yeah. Okay, so yeah, so let’s just go through it. So, we’ve seen it, right? So, we had this formula that we wanted to maximize, you know, our original function which was the product of T equals one to T, and then the product of the words, uh, position minus M less than or equal to J, less than or equal to M, J not equal to zero of, um, the probability of W. At prime at T plus J given WT according to the parameters of our model. Okay, and then we’d already seen that we were gonna convert that into the function that we’re going to use where we have J of Theta, where we had the minus one on T. Of the sum of T equals one to T of the sum of minus M, less than or equal to J less than or equal to M, J not equal to zero of the log of the probability of W times T, plus J, W, T. Okay, so we had that and then we’d had this formula that the probability of the outside word given the context word is this formula we just went through of xu ot vc over the sum of W equals one to the vocabulary size of xu wt vc. Okay, so that’s sort of our model. We want to min- minimize this. So, we wanna minimize this and we want to minimize that by changing these parameters. And these parameters are the contents of these vectors. And so, what we want to do now, is do calculus and we wanna say let’s work out in terms of these parameters which are, u and v vectors, um, for the current values of the parameters which we initialized randomly. Like what’s the slope of the space? Where is downhill? Because if we can work out downhill is, we got just gotta walk downhill and our model gets better. So, we’re gonna take derivatives and work out what direction downhill is and then we wanna walk that way, yeah. So, why do we wanna maximize that probable edge and like, like going through every word, it’s like [inaudible] given the [inaudible] So, well, so, so, I’m wanting to achieve this, um, what I want to achieve for my distributional notion of meaning is, I have a meaningful word, a vector. And that vector knows what words occur in the context of, um, a word- of itself. And knowing what words occur in its context means, it can accurately give a high probability estimate to those words that occur in the context, and it will give low probability estimates to words that don’t typically occur in the context. So, you know, if the word is bank, I’m hoping that words like branch, and open, and withdrawal, will be given high probability, cause they tend to occur with the word bank. And I’m hoping that some other words, um, like neural network or something have a lower probability because they don’t tend to occur with the word bank. Okay, um, does that make sense? Yeah. Yeah. And the other thing I was, I’d forgotten meant to comment was, you know, obviously, we’re not gonna be able to do this super well or it’s just not gonna be able, that we can say all the words in the context is going to be this word with probability 0.97, right? Because we’re using this one simple probability distribution to predict all words in our context. So, in particular, we’re using it to predict 10 different words generally, right? So, at best, we can kind of be giving sort of five percent chance to one of them, right? We can’t possibly be, so guessing right every time. Um, and well, you know, they’re gonna be different contexts with different words in them. So, you know, it’s gonna be a very loose model, but nevertheless, we wanna capture the fact that, you know, withdrawal is much more likely, um, to occur near the word bank than something like football. That’s, you know, basically what our goal is. Okay, um, yes, so we want to maximize this, by minimizing this, which means we then want to do some calculus to work this out. So, what we’re then gonna do is, that we’re going to say, well, these parameters are our word vectors and we’re gonna sort of want to move these word vectors, um, to, um, work things out as to how to, um, walk downhill. So, the case that I’m going to do now is gonna look at the parameters of this center word vc and work out how to do things with respect to it. Um, now, that’s not the only thing that you wanna do, you also want to work out the slope with respect to the uo vector. Um, but I’m not gonna do that because time in class is going to run out. So, it’d be really good if you did that one at home and then you’d feel much more competent. Right, so then, um, so what I’m wanting you to do is work out the partial derivative with respect to my vc vector representation of this quantity, that we were just looking at. Which is, um, the quantity in here, um, where we’re taking the log of that quantity. Right, the log of the x of u, o, T, v, c, over the sum of W equals one to V of the x of u, o, T, v, c. Okay, so this, um, so now we have a log of the division, so that’s easy to rewrite, um, that we have a partial derivative of the log of the numerator minus and I can distribute the partial derivative. So, I can have minus the partial derivative, um, of the denominator, um, which is log of this thing. [NOISE] Okay. Um, so this is sort of what was the numerator and this is what was the denominator. Okay. So, um, the part that was the numerator is really easy. In fact maybe I can fit it in here. Um, so log on exp are just inverses of each other, so they cancel out. So, we’ve got the partial derivative of U_o T V_c. Okay, so this point I should, um, just, um, remind people right that this V_c here’s a vector of- um, it’s still a vector right because we had a 100 dimensional representation of a word. Um, so this is doing multivariate calculus. Um, so you know, if you’re, if you at all, um, remember any of this stuff, you can say, “Ha this is trivial”. The answer to that is you are done, um and that’s great. But you know, if you’re, um, feeling, um, not so good on all of this stuff, um, and you wanna sort of, um, cheat a little on the side and try and work out what it is, um, you can sort of say, “Well, let me um,, work out the partial derivative, um with respect to one element of this vector like the first element of this vector”. Well, what I actually got here for this dot product is I have U_o one times V_c one, plus U_o two times V_c two plus dot, dot, dot plus U_o 100 times V_c 100, right, and I’m finding the partial derivative of this with respect to V_c one, and hopefully remember that much calculus from high school of none of these terms involve V_c one. So, the only thing that’s left is this U_o one, and that’s what I’ve got there for this dimension. So, this particular parameter. But I don’t only want to do the first component of the V_c vector, I also want to do the second component of the V_c vector et cetera, which means I’m going to end up with all of them turning up in precisely one of these things. Um, and so the end result is I get the vector U_o. Okay. Um, but you know, if you’re sort of getting confused and your brain is falling apart, I think it can be sort of kind of useful to re- reduce things to sort of um, single dimensional calculus and actually sort of play out what’s actually happening. Um, anyway, this part was easy. The numerator, we get um, U_o. Um, so things aren’t quite so nice when we do the denominator. So we now want to have this, um, B_d, V_c of the log of the sum of W equals one to the P_x of U_o T V_c. Okay. So, now at this point, I’m not quite so pretty. We’ve got this log sum X combination that you see a lot, and so at this point you have to remember that there was E, the chain rule. Okay. So, what we can say is here’s you know, our function F and here is the body of the function, and so what we want to do is um, do it in two stages. Um, so that at the end of the day, we’ve got this V_c at the end. So, we have sort of some function here. There’s ultimately a function of V_c, and so we gonna do with a chain rule. We’ll say the chain rule is we first take the derivative of this outside thing putting in this body, and then we remember that the derivative of log is one on X. So, we have one over the sum of W equals one to V of the exp of U_o T V_c and then we need to multiply that by then taking the derivative of the inside part which is um, what we have here. Okay. Times the derivative of the inside part with the important reminder that you need to do a change of variables, and for the inside part use a different variable that you’re summing over. Okay. So, now we’re trying to find the derivative of a sum of X. The first thing that we can do is v-very easy. We can move the derivative inside a sum. So, we can rewrite that and have at the sum first of the X equals one to V of the partial derivatives with respect to V_c of the [inaudible]. Um, so that’s a little bit of progress. Um and that point we have to sort of do the chain rule again, right. So, here is our function and here’s the thing in it again which is some function of V_c. So, we again want to do um, the chain rule. So, [NOISE] we then have well, the derivative of X um, is exp. So, we gonna have the sum of X equals one to V of exp of U_x T V_c, and then we’re going to multiply that by the partial derivative with respect to T V_c of the inside U_x T V_c. Well, we saw that one before, so, the derivative of that is U- well, yeah, U_x because we’re doing it through a different X, right. This then becomes out as U_x, and so we have the sum of the X equals one to V of this exp U X T B C times the U_of X. Okay. So, by doing the chain rule twice, we’ve got that. So, now if we put it together, you know, the derivative of V_c with respect of the whole thing, this log of the probability of O given C, right. That for the numerator it was just U_o, and then we’re subtracting, we had this term here, um, which is sort of a denominator, and then we have this term here which is the numerator. So, we’re subtracting in the numerator, we have the sum of X equals one to V of the exp of U_x T V_c times U_x, and then in the denominator, we have um, the sum of W equals one to V of exp of U_w T V_c. Um, okay, so we kind of get that. Um, oh wait. Yeah. Yeah, I’ve gotten. Yeah, that’s right. Um, okay. We kind of get that and then we can sort of just re-arrange this a little. So, we can have this sum right out front, and we can say that this is sort of a big sum of X equals one to V, and we can sort of take that U_x out the end and say, okay. Let’s call that put over here a U_x, and if we do that, sort of an interesting thing has happened because look right here, we’ve rediscovered exactly the same form that we use as our probability distribution for predicting the probability of words. So, this is now simply the probability of X given C according to our model. Um, so we can rewrite this and say that what we’re getting is U_o minus the sum of X equals one to V of the probability of X given C times U_x. This has a kind of an interesting meaning if you think about it. So, this is actually giving us, you know, our slope in this multi-dimensional space and how we’re getting that slope is we’re taking the observed representation of the context word and we’re subtracting from that what our model thinks um, the context should look like. What does the model think that the context should look like? This part here is formal in expectation. So, what you’re doing is you’re finding the weighted average of the models of the representations of each word, multiplied by the probability of it in the current model. So, this is sort of the expected context word according to our current model, and so we’re taking the difference between the expected context word and the actual context word that showed up, and that difference then turns out to exactly give us the slope as to which direction we should be walking changing the words representation in order to improve our model’s ability to predict. Okay. Um, so we’ll, um, assignment two, um, yeah. So, um, it’ll be a great exercise for you guys, um, to in- um, to try and do that for the cen-, wait, um I did the center words trying to look context words as well and show you that you can do the same kind of piece of math and have it work out. Um, if I’ve just got a few minutes left at the end. Um, what I just wanted to show you if I can get all of this to work right. Um, let’s go [inaudible] this way. Okay, find my. Okay. Um, so I just wanted to just show you a quick example. So, for the first assignment, um, again it’s an iPython Notebook. So, if you’re all set up you sort of can do Jupyter Notebook. Um, and you have some notebook. Um, here’s my little notebook I’m gonna show you, um, and the trick will be to make this big enough that people can see it. That readable? [LAUGHTER] Okay, um, so right so, so Numpy is the sort of, um, do math package in Python. You’ll want to know about that. If you don’t know about it. Um, Matplotlib is sort of the, one of the most basic graphing package if you don’t know about that you’re going to want to know about it. This is sort of an IPython or Jupyter special that lets you have an interactive matplotlib um, inside. And if you want to get fancy you can play it- play with your graphic styles. Um, there’s that. Scikit-learn is kind of a general machine learning package. Um, Gensim isn’t a deep learning package. Gensim is kind of a word similarity package which started off um, with um, methods like Latent Dirichlet analysis. If you know about that from modelling words similarities that sort of grown as a good package um, for doing um, word vectors as well. So, it’s quite often used for word vectors and word similarities that sort of efficient for doing things at large-scale. Um, yeah. So, I haven’t yet told you about will next time we have our own homegrown form of word vectors which are the GloVe word vectors. I’m using them not because it really matters for what I’m showing but you know, these vectors are conveniently small. It turns out that the vectors that Facebook and Google distribute are extremely large vocabulary and extremely high dimensional. So take me just too long to load them in the last five minutes of this class where conveniently uh, in our Stanford vectors we have a 100 dimensional vectors, um, and 50 dimensional vectors which are kinda good for doing small things on a laptop frankly. Um, so, what I’m doing here is Gensim doesn’t natively support GloVe vectors but they actually provide a utility that converts the GloVe file format to the word2vec file format. So I’ve done that. And then I’ve loaded a pre-trained model of word vectors. Um, and, so this is what they call a keyed vector. And so, the keyed vector is nothing fancy. It’s just you have words like potato and there’s a vector that hangs off each one. So it’s really just sort of a big dictionary with a vector for each thing. But, so this model has been a trained model where we just use the kind of algorithm we looked at and, you know, trained at billions of times fiddling our word vectors. Um, and once we have one we can then, um, ask questions like, we can say, what is the most similar word to some other words? So we could take something like, um, what are the most similar words to Obama let’s say? And we get back Barrack, Bush, Clinton, McCain, Gore, Hillary Dole, Martin, Henry. That seems actually kind of interesting. These factors from a few years ago. So we don’t have a post- post-Obama staff. I mean if you put in another word, um, you know, we can put in something like banana and we get coconut, mango, bananas, potato, pineapple. We get kind of tropical food. So, you can actually- you can actually ask uh, for being dissimilar to words. By itself dissimilar isn’t very useful. So if I ask most similar and I say um, negative equals, um, banana, um, I’m not sure what your concept of what’s most dissimilar to, um, banana is, but you know, actually by itself you don’t get anything useful out of this, um, because, um, you just so get these weird really rare words um, which, um, [LAUGHTER] definitely weren’t the ones who are thinking of. Um, but it turns out you can do something really useful with this negative idea which was one of the highly celebrated results of word vectors when they first started off. And that was this idea that there is actually dimensions of meaning in this space. And so this was the most celebrated example um, which was look, what we could do is we could start with the word king and subtract from it the meaning of man and then we could add to it the meaning of woman. And then we could say which word in our vector space as most similar in meaning to that word. And that would be a way of sort of doing analogies. Would be able to do the, um, analogy, man is the king as woman is to what? And so, the way we’re gonna do that is to say we want to be similar to king and woman because they’re both positive ones and far away from man. And so, we could do that manually, here is said manually, most similar positive woman king, negative man. And we can run this and lo and behold it produces queen. To make that a little bit easier I defined this analogy, um, analogy predicates so I can run other ones. And so I can run another one like analogy Japan Japanese, Austria is to Austrian. Um, and you know, I think it’s fair to say that when people first saw that you could have this simple piece of math and run it, and learn meanings of words. I mean it actually just sort of blew people’s minds how effective this was. You know. Like there- there’s is no mirrors and strings here, right? You know it’s not that I have a separate- a special sort of list in my Python where there’s a difficult I’m looking up, er, for Austria Austrian, uh, and things like that. But somehow these vector representations are such that it is actually encoding these semantic relationships, you know, so you can try different ones, you know, like it’s not that only this one works. I can put in France, it says French. I can put in Germany, it says German, I can put in Australia not Austria and it says Australian, you know that somehow if you want this vector representations of words that for sort of these ideas like understanding the relationships between words, you’re just doing this vector space manipulation on these 100 dimensional numbers, that it actually knows about them.This not only the similarities of word meanings but actually different semantic relationships between words like country names and their peoples. And yeah that’s actually pretty amazing. It really-you know, it’s sort of surprising that running such a dumb algorithm on um, vectors of numbers could capture so well the meaning of words. And so that’s sort of became the foundation of a lot of sort of modern distributed neural representations of words. Okay I’ll stop there. Thanks a lot guys and see you on Thursday. [NOISE]