A few weeks ago, I mentioned that we were working on algorithms that could identify handwritten numbers with 97 percent accuracy. This is a classic example of a “supervised” learning problem: We’re telling the computer ahead of time what it’s looking for—one of 10 symbols—and helping it train by feeding it data in which the correct answer has already been provided.
Now let’s imagine another scenario, one in which we give the computer no training and don’t even tell it that all of these unfamiliar handwritten symbols are numerals. This is “unsupervised” machine learning—the computer has to figure out what it’s looking at on its own. If the computer does its job, it will be able to figure out that it’s dealing with 10 distinct symbols, each of which might have a bunch of slight variations. Even so, it should hopefully be able to put the symbols into the appropriate categories—everything that looks like a “1” goes into one bucket, all of the “2”s go into another, and so forth.
In this example, humans would do just as well as computers, if not better. We would notice that there were 10 symbols (even if we didn’t know what they meant), and we could sort them ourselves assuming we have enough time and patience. Unsupervised learning gets more interesting when the machines find patterns we could never identify ourselves. Most of the top contenders for the Netflix prize, for example, didn’t build their recommendation engines using preconceived ideas of genre and taste in film. They just trained a computer to look for whatever patterns showed up, no matter how unexpected or obscure.
Most of the rest of Stanford’s machine-learning course is devoted to learning how to write these sorts of algorithms. I don’t yet have the programming chops to write my own unsupervised code without the direct supervision of the instructors, but I’ve started thinking about different subject areas where such programs would be illuminating.
Let’s say you were asked to take your favorite sport and divide the teams into two categories based on their style of play, ignoring structural groupings like leagues and divisions. As a baseball fan, my instinct would be to divide the major leagues into teams that emphasize pitching and those that focus on hitting. A football fan might divide the NFL into running teams and passing teams, and a basketball watcher could group the NBA into teams that run the floor and those that play at a slow pace.
Once we make it three categories, those obvious dichotomies are no longer as useful. Perhaps you could divide Major League Baseball into young teams, middle-aged teams, and ballclubs full of grizzled veterans. OK, now let’s ratchet up the assignment to five categories, or 10, or 20. Now would be a good time to stock up on graph paper.
To figure out how to divide baseball teams into one of five categories, I would feed the computer a huge mess of data—say, 100 different statistical categories for each team—and let it group them the same way it would group handwritten numbers, searching for similarities in the data. If there is a way to split teams into five groups, the machine will find it.
Until this week, this class had dealt primarily with cases where we wanted the computer to help us guess some predetermined piece of information. This was all interesting, but the goal was a little too practical for me. I wanted to take this course to develop a better understanding of how machines learn. This week helped satisfy that curiosity immensely, to the point that I think a lesson on unsupervised learning should come earlier in future semesters. For whatever reason, it’s innately human to want to categorize things. Learning how machines can help us do that, and without any of our biases and blind spots, is tremendously exciting.