The following is an answer I wrote on quora
Quora: How can I avoid overfitting?
 So first of all, let’s have a look at an overkilled case,
 From Christoph Bishop’s book: Pattern recognition and machine learning.
Blue dots are observed data associated with a noise model:
, where could be a zero-mean Gaussian noise.
If you use a very complex 9-polynomial function to fit the data, the red line is what you got. And this is what we called “Overfitting”. In this case, the polynomial tries its best to fit the seen data, meanwhile the noises are actually dominating the information gain.
 Now the same case, but with more observed data
As you might expect, the function is getting smoother, even with a 9-polynomial curve fitting. In this case, nosies are averaged so as not to dominate the scenario. So the first countermeasure of overfitting is
Sol. (1): TRY TO GET MORE TRAINING DATA!
What if getting more data is not possible? (In many real cases, it’s though a misfortune). Then we need to resort to control the complexity of your estimated function , which usually can be done in following ways.
Sol. (2): CROSS VALIDATION or MODEL SELECTION
This is rather straightforward. Objective is to spot an optimal model parameter, e.g., power exponents, penalty term, etc.. For that you need to evaluate your model several times with various settings, plus a proper criteria of evaluation, e.g., the error rate, F1-score, and to choose the one with the optimal output.
Sol. (3): REGULARIZATION
This technique is commonly used in many models, even the model is not configured with adding a regularization term at the beginning, it can somehow be converted as being equivalent to a regularization. The basic idea if rather simple.
The is a Loss function of the empirical error of observations. And the is the penalty term which means that the complexity of should also be considered as small as possible. The coefficient controls the trade-off of empirical loss and complexity. Note that if this objective is convex, the solution is unique.
One more thing….
It’s worth awhile also to mention the bias-variance decomposition. When you have only a few data points and try to train on it with a very complex function, it’s very possible that you hit overfitting. If you repeat this process many times, you will also notice that the function you get are different from time to time. This is what we call Variance, where your model is quite unstable. This is very easy to understand from above, because your model is dominated by the random noises, even your training error (Bias) is minimal. There’s also a struggling tradeoff between Bias and Variance, which is a central role in Machine Learning.