[MUSIC] Let's now turn to the more formal description of the k-Nearest Neighbor algorithm, where instead of just returning the nearest neighbor, we're going to return a set of nearest neighbors. Okay, so the setup here is just like in 1-nearest neighbor search, where we have our query article xq and we have the same corpus of documents, x1 to xN. But now instead of outputting just a single nearest document, we're going to output a list of k similar articles. So formally, what we're going to say is we're going to define a set of articles x nearest neighbor. So this is going to be the set of articles x nearest neighbor 1, all the way to x nearest neighbor k. Where the definition of what falls into this set is that for all articles xi not in our nearest neighbor set, the distance between xi and xq, our query article, is greater than or equal to the distance to the furthest nearest neighbor in our set. So the maximum distance over all these nearest neighbor articles, so this is for i = 1-k, the distance between x nearest neighbor, actually, let me change it so it's not the same i, I'll say sub j. So j equals 1 to k, x nearest neighbor j, and our query article. So this is very intuitive, it's just saying that the definition of the k-nearest neighbors is that any article that's not in your k-nearest neighbor set has a distance that's further than the distance to the furthest document within those k-nearest neighbors. That's where the max comes in here. Okay, so just like for 1-nearest neighbor, we can walk through our k-nearest neighbor algorithm in terms of the pseudocode here. Where just like in 1-nearest neighbor, we're going to define something that records the distance to our nearest neighbor found so far. But in this case we're going to have the distance to our k-nearest neighbors found so far. So it's going to be a sorted list of the distances to the k closest articles found to a given point in the algorithm. And so to begin with, we're just going to assume that the first k documents are what we're going to start with, as our k-nearest neighbors, and then we're going to improve upon those. So we're going to compute the distance between our query article and each of the k-first articles in the corpus, and we're going to sort those articles by their distances to create a sorted list. And so that's what's going to be held in this distance to k-nearest neighbors. And then within this little bookshelf thing here, this image represents a list of the sorted document indices. Then what we're going to do is for all the remaining documents in the corpus, so these are documents labeled k+1 to N, we're going to compute distance between that article and our query article. And then if that distance is less than the distance to our kth nearest neighbor. So let me just write this down here to be clear, distance to kth nearest neighbor, so furthest nearest neighbor in our set. Then what we're going to do is we're going to search for the index where that distance falls into this sorted distance list. So we're going to say, okay, we know it's closer than our kth nearest neighbor, but is it closer than our third nearest neighbor or second or first? And then whatever index that is, we're going to shift the queue that we have for our nearest neighbors found so far, and we're going to insert this article that we've just found into this priority queue. So in particular we're going to take this index j, which indexes where it falls in the priority queue, and we're going to just remove the kth nearest neighbor that's no longer in our set of k-nearest neighbors, now that we're inserting this new article, and then we're going to shift the indices. So what used to be labeled indices j to k- 1 are now going to be j to k. So to be clear, what used to be our k- 1th nearest neighbor now becomes our k-nearest neighbor, and so on all the way down to index j + 1. And then likewise, we shift our distances to these nearest neighbors and then we simply insert this new article here. So this is inserting new article. So the distance to the jth nearest neighbor is now delta, this new distance we just computed, and we record the index of that article in the jth element of this list here. Then at the end, after we've gone through and searched through all the articles in this corpus, we're going to return this little bookshelf list, this list of the k-nearest articles that we found in the corpus. So it's a pretty straightforward extension to the 1-nearest neighbor algorithm. So walking through these algorithms, what we've seen is that there are really two critical choices that we need to make. One is, how are we going to represent our document? So we have this thing, it's an article, it has a bunch of text, and somehow we need to translate that into a quantitative representation xq. And the other thing that's really, really critical to the performance of the method is, how are we going to to compute the distance between two articles? And so both of these things, both the document representation as well as how we compute distances between two documents, are choices that we have to make, and they're really, really critical to what gets returned from this nearest neighbor search. [MUSIC]