Visualizing statistical data
Statistics is a field of study that makes use of mathematical equations for data analysis.
Presenting statistical data is common across all industries, such as finance and government and research institutions. Organizations often use statistical data for testing, predictive analysis, and more.
For statistical data, a boxplot chart is frequently used to show overall statistical information on the distribution of the data. It is also used to detect outliers in the data.
Let's start with an example where we have the height and gender of 20 individuals, and we will see, using boxplots, the distribution of Height
for males and females:
# Defining a DataFrame
data_frame = data_frame = pd.DataFrame({
'Height':[175,208,159,130,178,179,168,100,155,165,195,250,190,157,153,194,177,184,170,210],
'Gender':['F','M','F','F','M','M','F','M','F','F','M','M','F','M','M','M','F','M','M','F']})
# Display DataFrame values
data_frame
The output will be as follows:
Figure 8.40 -- A DataFrame containing the height and gender of 20 individuals
Now, we can plot a boxplot chart (sometimes referred to as Cat Whiskers plot) grouped by Gender
:
data_frame.boxplot(by="Gender", column="Height");
The output should be as follows:
Figure 8.41 -- Plotting the distribution of height for each gender
Let's have a deeper look at this boxplot to understand each component of it:
- The bottom horizontal line represents the minimum value excluding any outliers. If we consider this for the female population, the lowest height (excluding the outliers) was 157 cm.
- The bottom edge of the rectangle represents the first quartile or 25%. If we consider this for the female population, 25% of the population was under a height of 160 cm.
- The middle line inside the rectangle represents the second quartile, also called the median. If we consider this for the female population, half the population was under a height of 170 cm.
- The top edge of the rectangle represents the third quartile or 75%. If we consider this for the female population, 75% of the population was under a height of 178 cm.
- The top horizontal line represents the maximum value, excluding any outliers. If we consider this for the female population, the highest height (excluding the outliers) was 190 cm.
- The circles at the top and bottom represent the outliers.
In this section, you learned about handling statistical data. The preceding examples only show a small part of statistical concepts; it is worthwhile to learn other statistical evaluation methods, such as variances, standard deviations, and correlations. Data visualization with statistical data is a vast topic and beyond the scope of this chapter.
Exercise 8.02 -- Boxplots for the Titanic dataset
In this exercise, we will load the Titanic dataset, handle the missing data, and build a few boxplots in order to find the correlation of different factors contributing to the chances of survival.
The following steps will help you complete the exercise:
Open a new Jupyter notebook file.
Import the
pandas
,numpy
, andmatplotlib
packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
- Next, load the CSV file as a DataFrame:
file_url = 'titanic.csv'
data_frame = pd.read_csv(file_url)
- Use the
head()
function to display the first five rows of the DataFrame, and check that the data was properly loaded:
data_frame.head()
The output will be as follows:
Figure 8.42 -- A DataFrame displaying the first five rows of the dataset
- Remove the rows with missing data, and then display the DataFrame:
data_frame = data_frame.dropna()
data_frame
This will result in the following output:
Figure 8.43 -- A DataFrame displaying the dataset without missing data
- Plot a boxplot group by
survived
on the'age'
column:
data_frame.boxplot(by='survived', column='age');
This should result in the following output:
Figure 8.44 -- Plotting the distribution of age for each outcome
We can see that, in general, younger passengers had a higher chance of surviving with 75% of the survivors being under 36 compared to 39 in the other group. Moreover, older passengers, regardless of the outcome, are classed as outliers. This might be due to the very small population of elderly.
- Plot a boxplot by
'survived'
and on the'passenger_fare'
column:
data_frame.boxplot(by='survived', column='passenger_fare');
This should result in the following output:
Figure 8.45 -- Plotting the distribution of passenger fares for each outcome
It seems that the higher the passenger fare was, the higher the chance of survival was for the passenger. This can be easily seen from the position of the box on the survivor group, which is higher than the other group's box.
- Plot a boxplot by
survived
and on theticket_class
column, as follows:
data_frame.boxplot(by='survived', column='ticket_class');
You should see output as follows:
Figure 8.46 -- Plotting the distribution of the ticket class for each outcome
We can see that, in general, passengers from first class and second class had a higher chance of surviving, with half of them being at least from the first or second class.
Now that we have seen how we can use a specific chart type (boxplot) to derive quick insights, we can move on to our next topic about visualizing multiple data plots.