What is Machine Learning? What is Cluster Analysis and Dimensionality Reduction?

865 views

I understand quite a bit about technology but I never understood how machine learning works.

In: Technology

5 Answers

Anonymous 0 Comments

Machine Learning is about generalizing from a set of examples. For example, you have some superhero figures and some toy cars, and when you see a new toy, you can decide if it looks more like a superhero or a car. Machine learning is when a computer can do this.

Cluster analysis – grouping all your toys into boxes, where each box will contain similar toys. Like when you decide to put superheroes into box 1, cars into box 2 and lego particles into box 3 without your parents giving specific orders what goes where.

Dimensionality Reductiоn – selecting or constructing a small number of features that can reasonably describe objects from a diverse set. Like you have a large heap of toys and you notice that a toy’s size, shape and prevailing color define what games can be played with this toys, other properties being less relevant than that three.

Anonymous 0 Comments

Machine learning is where the machine runs many programs to get to a goal. Each of these programs have slight variations and whichever gets the closest to the goal moves on to the next generation where all of the other programs inherit the variation that the farthest one had. This repeats and the machine learns a lot.

Anonymous 0 Comments

So simplistically machine learning is when you create a program that “learns” from information that you provide. You can make it learn simple things like deciding whether a toy is a car or a doll, to use another comment’s example. The way that it works is you give it lots of pictures that you already know are cars/dolls. You tell it, “This is a car” or “This is a doll” for each one. Then it figures out what things are similar between all the car pictures and all the doll pictures and can use that information to decide whether the next picture is a car or doll.

Cluster analysis is just taking large amounts of information and putting it in boxes based on how similar the data is. Then you can choose just one of the data points in the cluster, and it’ll be a good example of the rest of the data. Dimensionality reduction basically just means if you have many different points of information (or dimensions) about something, like its position, speed etc. you can decide which points (dimensions) are more important based on for eg. how much the points change. A large change usually means that information is more important.

Anonymous 0 Comments

ML is a branch of AI that aims to develop computer systems that learn and improve from experience without any human intervention or programming. During this process, machines are provided with huge amounts of data which they analyze for patterns and then learn from the examples. Over time, the systems are able to automatically make their own decisions and adjust their actions accordingly.

Anonymous 0 Comments

Machine Learning is, in some sense, simply a branch of statistics. It is a general term for the process of determining a function based on analysis of *training* data, which can then be applied to *hidden* data (i.e. data which hasn’t yet been seen, possibly because it hasn’t been created yet). In normal computer programming, you (the programmer) tell the computer exactly what to do by writing functions which, when applied to inputs, yield the desired output. Sometimes though, those functions are very hard to write (like recognizing the difference between a cat and a dog in a picture), and this is where machine learning can help. Conceptually, rather than writing these hard-to-write functions by ourselves, we can give the computer a bunch of data and ask it to learn about the commonalities in that data.

Let’s imagine a very, very simple example. Imagine we’re trying to write a program which determines whether or not a light is on in a room. We have access to an ambient light sensor which gives as its output a number: higher numbers indicate the room is brighter, and lower numbers indicate the room is dimmer. This would be a very easy function for us to write by hand:

lightOn :: Double -> Bool
lightOn brightness = brightness > 0.5

Or something like that. If the brightness is more than some threshold, then we say the light is on. Otherwise, it’s off.

We can view this problem in machine learning terms. We have one data point to work with, the brightness, which means that the *feature vector* of our problem has order one (i.e. one component). This also means that the *dimensionality* of our problem space is one. Geometrically, we’re looking at a line. Data points further to the left on the line have lower brightness, data points to the right on the line have higher brightness. What we need to do is figure out where on the line to draw our *discriminator*, which is to say, the threshold above which the light is “on”, and below which the light is “off”.

In the implementation above, it was 0.5. I just made that number up. If we were doing this with machine learning, we would have a ton of sensor read outs from different rooms, paired with whether or not the light was on in each room. Then, we would look at that whole corpus of data and determine, statistically, where the “cutoff” is between “on” and “off”. This would be our separator.

Now this is a very, very simple example. In more complex examples, the feature vectors may have dozens or even *hundreds* of components. Deep Learning models often have *thousands* of components in their feature vectors. The more components you have, the more dimensions in your space. In all cases, though, you’re still trying to draw a “line” to separate the “yes” answers from the “no” answers (when writing a true/false discriminator function).

Of course, as your dimensionality goes up, so too does the complexity of your dividing “line”. If your feature vector has one component (as ours does above), then your diving “line” is just a point. If you have two components, then it’s a line (and it might not even be a straight one!). If you have three, then it’s a plane. At higher dimensions, we stop giving them names and just say “hyperplane”. Either way, hopefully it’s clear that the data you’re training on (and ultimately applying the function to!) is mapped onto points in higher dimensional space, and the discriminator function reduces the problem of figuring out yes/no answers to a geometrical problem of determining whether the point is above or below the separating hyperplane.

Note that you can generalize this slightly by also producing the *confidence* that a certain input is a yes (or a no) result, where that confidence is calculated based on how far away from the hyperplane lies the data point represented by the feature vector in question.

Of course, not all problems can be phrased in terms of yes/no. Speech recognition is a decent example. You can’t possibly have hundreds of thousands of yes/no discriminators, one for each word in the English language. So instead, you will use cluster analysis. Instead of producing a function which says “yes/no” in response to an input, you’ll produce a function which returns some result based on a finite set of possible answers.

This is *still* a very geometric process! Again, think of data points in space, where each data point is some input represented by a feature vector. Some of those data points will be close to each other, and far away from all the others, forming their own little island where the density of data points in space is higher. There could be *many* such islands. In terms of speech recognition, the distance between the islands probably corresponds to how much the words sound alike when spoken in a “normal” accent. For example, “dough” and “throw” are probably very close together, but “fight” and “war” are very very far apart (because they sound so different).

*Clustering* is the process of, again statistically, figuring out the size, shape, and position of all of these little islands. The goal is that you should be able to take some input, compute its feature vector, and figure out which island it is *closest* too. That island becomes the result of your function.

BTW, in modern speech recognition techniques, this process is actually considerably more layered. What is generally done is you have a preliminary clustering model which first classifies the *audio* into strings of *phonemes* (since most languages only have about 30-40 different phonemes). These these models are tricky because they need to take into account *time*. If you hear the term *convolutional*, you’ll know that you’re probably about to hear something about temporal or spatial invariance. Once you have a stream of phonemes (note: each phoneme will have a confidence score, and for some phonemes, there may be multiple possible answers!), you can feed *that* stream into the next stage in the pipeline. This could be a hard-and-fast grammar model (pioneered by Noam Chomsky), or it could be another classifier which attempts to chunk up those phonemes and classify them into words. Note that this model *also* must be convolutional since it may be uncertain about what a particular word might be until it hears the *next* word, which clarifies things. For example, distinguishing “dough” and “throw” is difficult until you hear “the ball”, at which point you know with absolute certainty what the first word was. Computer speech recognition functions work the same way.

Anyway, hopefully that gives you a general idea of where these terms come from and what they pertain to. When in doubt, always think geometrically. Almost everything in machine learning comes down to computing hyperplanes in hyperspatial problem domains with sparse data points represented by feature vectors.