Wayne Wang Stay Curious

Notes on machine learning theory

Apirl 30, 2015 - Shang Hai

The purposes of writing this article are

  • Understand the key points of machine learning theory.
  • More important is to record the process of learning new stuff, which on the other hands could be recalled or referenced later on to reveal how human beings master new knowledge gradually, compared with how machine could potentially follow the similar learning procedure.

I am going to start with my overall feeling about machine learning or AI, and then my understand of machine learning theory from Andrew’s lecture, and finally end up with my thoughts.

In my feeling now, machine learning or artificial intelligent are the words that are over-described, and current methods or theory are all targeting to solve very specific problems, which have following characteristics,

So, it is actually far away from what people usually expect this discipline to achieve,

Roman isn’t built in one day. The real “intelligent” needs interdisciplinary study involving such as computer science, neuroscience, linguistics, philosophy, math, etc. Machine learning is just the subfield that explores the construction and study of algorithms that can “learn” from sample data and make predictions on new data. Though various machine learning methods can be used to solve lots of real worlds problems smartly, they don’t attempt to simulate how human beings think, instead they are just well designed. So, let’s see how is the theory that can build a “smart” machine on given data set as first step to AI.

I think the process of coming up with a good learning strategy for machine is kind of similar as how human beings discover a natural law which all aims to solve some issues. The process could be considered roughly as,

Then machine learning theory is the theory that aims to choose the best learning strategy based on the sample data set for a given issue. There are a few terminologies,

Quote from Andrew’s lecture, “there is a tradeoff between bias and variance. If our model is too “simple” and has very few parameters, then it may have large bias (but small variance); if it is too “complex” and has very many parameters, then it may suffer from large variance (but have smaller bias)”.

One thing has to be noticed is that probability theory takes very important role in machine learning. Just thinking the real world cases that input data set from a solvable problem almost always have noise or too many random factors involved and probably follows into a distribution. There are two distributions considered by the theory,

Then, this can gives guarantee on generalization error of a given hypothesis in two aspects,

Coming here, I just feel amazing that I even don’t know what the definition of the hypothesis is, I could already tell how well this hypothesis can do the prediction as long as I have “enough” training set to “try” this hypothesis. The following lemma 2 explains the theory for this, and combine with the lemma 1, we can see how to pick up the best hypothesis out of a group. (Following two lemmas are quote from Andrew’s lecture),

Lemma 1, the union bound. Let A1, A2, . . . Ak be k different events (that may not be independent). Then,

P(A1 ∪···∪Ak) ≤ P(A1)+...+P(Ak)

In probability theory, the union bound is usually stated as an axiom (and thus we won’t try to prove it), but it also makes intuitive sense: The probability of any one of k events happening is at most the sums of the probabilities of the k different events.

Lemma 2, Hoeffding inequality. Let Z1, . . . , Zm be m independent and identically distributed (IID) random variables drawn from a Bernoulli(φ) distribution. I.e., P(Zi =1)=φ, and P(Zi =0)=1−φ. Let φˆ=(1/m) Sum(Zi) (i from 1 to m) be the mean of these random variables, and let any γ > 0 be fixed. Then,

P (|φ − φˆ| > γ) ≤ 2 exp(−2γ*γ*m)

This lemma (which in learning theory is also called the Chernoff bound) says that if we take φˆ—the average of m Bernoulli (φ) random variables—to be our estimate of φ, then the probability of our being far from the true value is small, so long as m is large. Another way of saying this is that if you have a biased coin whose chance of landing on heads is φ, then if you toss it m times and calculate the fraction of times that it came up heads, that will be a good estimate of φ with high probability (if m is large).

By means of these two lemmas, the mysterious veil of the machine learning theory will be uncovered. Again, let me restate what problem this theory want to address. It is that “give a learning problem and a group of pre-defined hypothesis, how to make sure the picked hypothesis can do the best job”. Well, there are two cases, (I only state the results and some of my understanding, but leave the details out of the writing.) For the case of finite hypothesis set, the theory is in the formula of,

P(∀h∈H.|ε(hi)−εˆ(hi)|≤γ) ≥ 1 − 2k exp(−2γ^2m)

There are three quantities of interest here: m, γ and P, which are the size of training set, the far-away between training error and generalization error, and the probability of this far-away being small respectively.

There are three outcomes or statements,

For the case of infinite hypothesis set, it has very similar result as the finite one. Usually, many hypothesis classes contain an infinite number of candidates, such as any real number parameterized hypothesis, they have infinite functions fixed by its parameters. The interesting thing is that we use computer to represent real numbers, which use 64 bits to represent a double precision floating point number. So, it is actually not infinite that totally consist of at most k = the 64 power of 2 different hypothesis. Then, we can just substitute this to the previous inequation so as to get

m ≥ O(1/ γ2 log(2^64*d/ δ)) = O(d/ γ2 log(1/ δ)) = O(d)

Thus, this indicates that the number of training examples needed is at most linear in the parameters of the model.

Besides, one thing worth note is that all the above results are proved for the learning strategy that use empirical risk minimization (ERM). This is the most “basic” learning strategy and very intuitive that the hypothesis is picked with minimum training error. (There are non-ERM learning strategy, but good theory for this is still an area of active research. I may learn and dig into this later.)

At last, let me wrap all these up a little bit by three aspects: the issue, the theory and my thought.

The issues are often those that need human intelligence involved, such as recognize objects or make a decision, not how a planet orbit around a star, in other words, they are not like the issues that can be solved by fixed functions or algorithm. Usually, there are many random factors that can’t be described by exact mathematics model. So, probability theory takes important role in machine learning.

The theory is just what I talked above, which it can guide in the following steps to solve a given learning issue,

This is too high level, and I know there are many different hypothesis classes, such as linear classifier, neural network, SVM, or perceptron, etc. that are worth to understand in details so as to know which class is better for different issues. Besides, to pick the best hypothesis is not just “pick” one by one, especially if the set is infinite, it is not possible to check each of them. Instead, there are many mathematics optimization techniques that can be used to “pick” the one quickly, such as least square or gradient descent. Overall, theory is good that it make things clear, but a good understanding has to be obtained by working on real project about machine learning.

My expectation of the strong artificial intelligence is that it doesn’t have to be physically like human beings, but have to pass Turing test that when I talk to “it”, I almost can’t distinguish whether I talk to a machine or a human beings. I think this is the long term goal, and we are far-far away from it (even maybe it never be realized). The short term goal to make machine “intelligent” enough so as to assist human beings to increase productivity I think is more feasible and important with practical significance.

As I can see there are no big breakthroughs towards the long term goal since AI research. Because, there are four key characteristics that human beings have but a machine doesn’t,

  1. Human beings are always curious, and can ask questions to unknown.
  2. Human beings can do experiments, and find natural laws.
  3. Human beings can induce and deduce from basic laws, and then can solve a class of similar issues.
  4. Human beings can evolve, and all good things are encoded into its DNA.

Closing words at last, making machine more clever is the trend and inevitable, because kinds of machines are already part of our life.


Update on 5/31/2015, Large Scale Deep Learning, a good summary of latest progress of machine learning from Google, which proved that Deep Neural Network is an important tool in building intelligent system. It is really exciting to see such solid and encouraging improvement and progress. When having time, it is definitely worth to know all the details about DNN, but before that let me just quote the conclusion from the article,

Deep neural networks are very effective for wide range of tasks,

  • By using parallelism, we can quickly train very large and effective deep neural models on very large datasets.
  • Automatically build high-level representations to solve desired tasks.
  • By using embeddings, can work with sparse data.
  • Effective in many domains: speech, vision, language modelling, user prediction, language understanding, translation, advertising, ...
Fork me on GitHub