Getting Started with Machine Learning in Python

This chapter will expose us to the vernacular of machine learning and the common tasks that machine learning can be used to solve.

Afterward, we will learn how we can prepare our data for use in machine learning models. We have discussed data cleaning already, but only for human consumption---machine learning models require different preprocessing (cleaning) techniques. There are quite a few nuances here, so we will take our time with this topic and discuss how we can use scikit-learn to build preprocessing pipelines that streamline this procedure, since our models will only be as good as the data they are trained on.

Next, we will walk through how we can use scikit-learn to build a model and evaluate its performance. Scikit-learn has a very user-friendly API, so once we know how to build one model, we can build any number of them. We won't be going into any of the mathematics behind the models; there are entire books on this, and the goal of this chapter is to serve as an introduction to the topic. By the end of this chapter, we will be able to identify what type of problem we are looking to solve and some algorithms that can help us, as well as how to implement them.

The following topics will be covered in this chapter:

  • Overview of the machine learning landscape
  • Performing exploratory data analysis using skills learned in previous chapters
  • Preprocessing data for use in a machine learning model
  • Clustering to help understand unlabeled data
  • Learning when regression is appropriate and how to implement it with scikit-learn
  • Understanding classification tasks and learning how to use logistic regression

Chapter materials

In this chapter, we will be working with three datasets. The first two come from data on wine quality that was donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml/index.php) by P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, which contains information on the chemical properties of various wine samples, along with a rating of the quality from a blind tasting by a panel of wine experts. These files can be found in the data/ folder inside this chapter's folder in the GitHub repository (https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_09) as winequality-red.csv and winequality-white.csv for red and white wine, respectively.

Our third dataset was collected using the Open Exoplanet Catalogue database, which can be found at https://github.com/OpenExoplanetCatalogue/open_exoplanet_catalogue/. This database provides data in eXtensible Markup Language (XML ) format, which is similar to HTML. The planet_data_collection.ipynb notebook on GitHub contains the code that was used to parse this information into the CSV files we will use in this chapter; while we won't be going over this explicitly, I encourage you to take a look at it. The data files can be found in the data/ folder, as well. We will use planets.csv for this chapter; however, the parsed data for the other hierarchies is provided for exercises and further exploration. These are binaries.csv, stars.csv, and systems.csv, which contain data on binaries (stars or binaries forming a group of two), data on a single star, and data on planetary systems, respectively.

We will be using the red_wine.ipynb notebook to predict red wine quality, the wine.ipynb notebook to classify wines as red or white based on their chemical properties, and the planets_ml.ipynb notebook to build a regression model to predict the year length of planets and perform clustering to find similar planet groups. We will use the preprocessing.ipynb notebook for the section on preprocessing.

Back in Chapter 1, Introduction to Data Analysis , when we set up our environment, we installed a package from GitHub called ml_utils. This package contains utility functions and classes that we will use for our three chapters on machine learning. Unlike the last two chapters, we won't be discussing how to make this package; however, those interested can look through the code at https://github.com/stefmolin/ml-utils/tree/2nd_edition and follow the instructions from Chapter 7, Financial Analysis -- Bitcoin and the Stock Market, to install it in editable mode.

The following are the reference links for the data sources:

  • Open Exoplanet Catalogue database , available at https://github.com/OpenExoplanetCatalogue/open_exoplanet_catalogue/#data-structure.
  • P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. Available online at http://archive.ics.uci.edu/ml/datasets/Wine+Quality.
  • Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [ http://archive.ics.uci.edu/ml/index.php]. Irvine, CA: University of California, School of Information and Computer Science.