Visualizing Data with Pandas and Matplotlib

Now that we are well-equipped to quickly create a variety of visualizations in Python using pandas and matplotlib.

So far, we have been working with data strictly in a tabular format. However, the human brain excels at picking out visual patterns; hence, our natural next step is learning how to visualize our data. Visualizations make it much easier to spot aberrations in our data and explain our findings to others. However, we should not reserve data visualizations exclusively for those we present our conclusions to, as visualizations will be crucial in helping us understand our data quickly and more completely in our exploratory data analysis.

There are numerous types of visualizations that go way beyond what we may have seen in the past. In this chapter, we will cover the most common plot types, such as line plots, histograms, scatter plots, and bar plots, along with several other plot types that build upon these. We won't be covering pie charts---they are notorious for being difficult to read properly, and there are better ways to get our point across.

Python has many libraries for creating visualizations, but the main one for data analysis (and other purposes) is matplotlib. The matplotlib library can be a little tricky to learn at first, but thankfully, pandas has its own wrappers around some of the matplotlib functionality, allowing us to create many different types of visualizations without needing to write a single line with matplotlib (or, at least, very few). For more complicated plot types that aren't built into pandas or matplotlib, we have the seaborn library, which we will discuss in the next chapter. With these three at our disposal, we should be able to create most (if not all) of the visualizations we desire. Animations and interactive plots are beyond the scope of this book, but you can check out the Further reading section for more information.

In this chapter, we will cover the following topics:

  • An introduction to matplotlib
  • Plotting with pandas
  • The pandas.plotting module

Chapter materials

The materials for this chapter can be found on GitHub at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_05. We will be working with three datasets, all of which can be found in the data/ directory. In the fb_stock_prices_2018.csv file, we have the daily opening, high, low, and closing prices of Facebook stock from January through December 2018, along with the volume traded. This was obtained using the stock_analysis package, which we will build in Chapter 7, Financial Analysis -- Bitcoin and the Stock Market. The stock market is closed on the weekends, so we only have data for the trading days.

The earthquakes.csv file contains earthquake data collected from the United States Geological Survey (USGS ) API (https://earthquake.usgs.gov/fdsnws/event/1/) for September 18, 2018 through October 13, 2018. For each earthquake, we have the value of the magnitude (the mag column), the scale it was measured on (the magType column), when (the time column) and where (the place column) it occurred, and the parsed_place column for the state or country where the earthquake occurred (we added this column back in Chapter 2, Working with Pandas DataFrames). Other unnecessary columns have been removed.

In the covid19_cases.csv file, we have an export from the daily number of new reported cases of COVID-19 by country worldwide dataset provided by the European Centre for Disease Prevention and Control (ECDC ), which can be found at https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide. For scripted or automated collection of this data, the ECDC makes the current day's CSV file available via https://opendata.ecdc.europa.eu/covid19/casedistribution/csv. The snapshot we will be using was collected on September 19, 2020 and contains the number of new COVID-19 cases per country from December 31, 2019 through September 18, 2020, with partial data for September 19, 2020. For this chapter, we will look at the 8-month span from January 18, 2020 through September 18, 2020.

Throughout this chapter, we will be working through three notebooks. These are numbered in the order they will be used---one for each of the main sections of this chapter. We will begin our discussion of plotting in Python with an introduction to matplotlib in the 1-introducing_matplotlib.ipynb notebook. Then, we will learn how to create visualizations using pandas in the 2-plotting_with_pandas.ipynb notebook. Finally, we will explore some additional plotting options that pandas provides in the 3-pandas_plotting_module.ipynb notebook. You will be prompted when it is time to switch between the notebooks.