Machine learning is currently the most in demand skill set in the I.T. industry. Machine learning problems can
generally be broken up into two categories:
-
Supervised ML problems - Use features and labels associated with
training data. Features are properties of training data, while labels are the output or descriptors given
to the features.
-
Unsupervised ML problems - use clustering.
Sklearn is a very useful python package for ML, developed by Google. It contains 6 very useful tools used in
ML problems:
-
Classification - Tool for classifying learning data before apply ML.
-
Regression - Tool for trend analysis, which can identify and learn the trend, to predict the next
possible value. The accuracy of regression is dependent on the quantity of data, relevant features,
and the choice of relevant features.
-
Clustering - Automatic grouping of sets.
-
Dimensionality reduction - Helps choose relevant features, for problems where there are many
variables, and many features.
-
Model selection - Helps choose the best model to use for a set of data.
-
Preprocessing - Getting data in better shape for machine learning.
You can find many useful datasets to play with on the internet. A popular repository is, UCI Machine Learning Repository. It also contains the 'Iris'
dataset which is the 1st dataset often used to test ML algorithms. It is also one of the oldest datasets.
It's simplicity makes it a good starting point.
To use ML algorithms a large amount of work is involved in getting data ready to be processed. For example,
Scikit learn is currently designed and optimised in a way such that it works with numbers and not strings.
This is how it ensures it can run very quickly. Often, some time is required to manipulate and clean data so
it is in format ready to be processed.
Some good books recommended to me for ML include the following:
-
Pattern Recognition and Machine Learning, by Christopher Bishop. It uses slightly more difficult
terminology.
-
Data Mining: Concepts and Techniques, by Han and Kamber. Easier to read.
-
Mining of Massive Datasets, by Leskovec, Rajaraman and Ullman. Chapter 7 has a great explanation of
clustering.