Transforming back to real units
What if we wanted to get our X
data back into the original units? To do this, we simply need to invert the equation for the transform and apply it using the scaling parameters.
We've already seen the general equation for the min
/max
transform. Here is the equation for the standardization transform:
Here, µ
is the mean of the data, and s
is the standard deviation. However, if we use the sklearn
methods, we don't have to do this ourselves, as illustrated by the following snippet. Here, we use the method from the scaler
, .inverse_transform()
, to restore the car data we transformed in the previous section. As with .transform()
or .fit_transform()
, the result is a numpy
array, so we have to convert back to a DataFrame
and restore the column names:
X = scaler.inverse_transform(X_scaled)
X = pd.DataFrame(X)
X.columns = my_data.columns[1:-1]
X.head()
You'll see the following output upon running this code:
Figure 9.28 -- The X data transformed back to original units
This concludes our introduction to scaling data for modeling. We've seen along the way how to construct simple linear regression models. Now, let's look into tools in Pandas
as well as some additional sklearn
methods that are useful for data modeling.
Exercise 9.02 -- Scaling and normalizing data
The Pandas
DataFrame
structure makes it easy to apply functions to subsets of columns of data. In this exercise, you will use such functionality to scale data. We choose to scale the data because we want to have a common dataset, regardless of the model we choose. Here, you will work again with the weather data from the Austin weather dataset. You need to prepare the data prior to considering models to predict the events. You will load the data, address some issues with the data types, and then apply a scaler to transform the data:
- For this exercise, all you will need is the
pandas
library,numpy
, two modules fromsklearn
, andmatplotlib
. Load them in the first cell of the notebook:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
You are going to use the sklearn
StandardScaler()
method to scale data as preparation for modeling:
- It's a good practice to look at data before scaling, so you want to implement the
utility
function seen earlier in the chapter to plot a grid of histograms. The following code loops over the variables you pass in, checks to see whether there are too many bins (the number of slices in the histogram) and adjusts accordingly, uses aPandas
.hist()
method (which usesmatplotlib
) to plot the histogram in its grid location, and adds a per-chart title that shows the variable. You call the function by passing in aDataFrame
, the variables you wish to plot, rows and columns for the grid, and the number of bins. ThePandas
slice notation ([:-1]
) is used to pass all but the last column as your data (it doesn't make sense to "plot" the vehicle names). Note that for some variables, there may be only a few unique values, which is why the function modifies the bins in those cases:
def plot_histogram_grid(df, variables, n_rows, n_cols, bins):
fig = plt.figure(figsize = (11, 11))
for i, var_name in enumerate(variables):
ax = fig.add_subplot(n_rows, n_cols, i + 1)
if len(np.unique(df[var_name])) <= bins:
use_bins = len(np.unique(df[var_name]))
else:
use_bins = bins
df[var_name].hist(bins = use_bins, ax = ax)
ax.set_title(var_name)
fig.tight_layout()
plt.show()
- Now, load the
austin_weather.csv
file into aDataFrame
calledweather_data
, changeEvents
as we did before, and inspect the result:
weather_data = pd.read_csv('Datasets\\\\austin_weather.csv')
weather_data.drop(columns = ['Date'], inplace = True)
weather_data['Events'] = ['None'
if weather_data['Events'][i] is ' '
else weather_data['Events'][i]
for i in range(weather_data.shape[0])]
weather_data.describe().T
The result should be as follows:
Figure 9.29 -- Using the .describe() method on the data
- From the preceding output, you can see that most of the columns were not read in numerically, as only the
TempHighF
,TempAvgF
andTempLowF
columns are present in thedescribe
result. If there were a new, unknown dataset. you'd have to do more EDA to investigate what is in the data and how to address it. In this case, the issue is caused by the use of'-'
to represent missing data and theT
value in precipitation columns to represent 'trace'. Use thePandas
.replace( )
method to replace'-'
withnp.nan
andT
with0
. After the replacement, print a list of rows with missing data.
weather_data.iloc[:, :-1] = \
weather_data.iloc[:, :-1].replace(['-', 'T'],
[np.nan, 0]).astype(float)
print(weather_data.loc[weather_data.isna().any(axis = 1), :].index)
Running this code will result in the following output:
Int64Index([174, 175, 176, 177, 596, 597, 598, 638, 639, 741, 742, 953,
1001, 1107],
dtype='int64')
Here, the data columns are sliced using :-1
for the columns in .iloc[]
, which skips the Events
column, and then .replace()
is used to change the values. The Pandas
.replace()
method can take lists for the things to replace and the replacement values, which means both '-'
and T
at the same time. The na_rows
code uses the Pandas
.isnna()
method, which creates a DataFrame
the same shape as what is passed, with True
or False
in it, and then the .any(axis = 1)
method chooses any element where the value is True
, and by passing axis = 1
, that gives us the rows (look across the rows for any True
values). Finally, we extract the index values with .index
and just print the result. You can see that there aren't very many with missing values now, so dropping those rows is a good approach.
- Drop the rows with missing values, verify the result using
.describe().T
, and then plot histograms of all the variables using theutility
function. Use thePandas
.dropna()
method withaxis = 0
, telling the method to drop rows with missing values. Before printing, change thePandas
float format to 2 digits to make the output easier to read:
weather_data.dropna(axis = 0, inplace = True)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
print(weather_data.describe().T)
The output should be similar to the following:
Figure 9.30 -- The cleaned data before scaling
- Visualize the variable distributions using the
utility
function to generate a grid of histograms:
plot_histogram_grid(df = weather_data.iloc[:, :-1],
varaibles = weather_data.iloc[:, :-1].columns,
n_rows = 5,
n_cols = 5,
bins = 25)
This produces the following:
Figure 9.31 -- The weather data variables before scaling
You can see some interesting features of the data that might affect modeling; for instance, PrecipitationSumInches
is mostly 0, and WindHigh
has an odd gap near 10 MPH. Several of the variables are skewed. As a first step, we'll choose to proceed with scaling the data.
- Recall from the information leakage discussion that when splitting data into, say, train and validation, it's important to split first and then scale; otherwise, the scaler has information about the train data leaking into the validation data. Split the data 70/30. Use
.train_test_split()
. Remember to split they
values (Events
) as well:
train_X, val_X, train_y, val_y = \
train_test_split(weather_data.drop(columns = 'Events'),
weather_data['Events'],
train_size = 0.7,
test_size = 0.2,
random_state = 42)
- Now, scale all the numeric data using the
StandardScaler
method, and display the first five rows of the result:
scaler = StandardScaler()
scaler = scaler.fit(train_X)
scaled_train = pd.DataFrame(scaler.transform(train_X))
scaled_train.columns = weather_data.columns[:-1]
scaled_val = pd.DataFrame(scaler.transform(val_X))
scaled_val.columns = weather_data.columns[:-1]
scaled_train.head()
The result should be as follows:
Figure 9.32 -- The scaled train split of the weather data
At this point, the data is in a form you can use for initial modeling -- you have train and validation splits, and the data is scaled. You should be comfortable with the key concepts of addressing missing or incorrectly formatted or typed data, making two or three splits of the data, fitting a scaler to the train data, and then applying the fitted scaler to the validation (and test) split.
Activity 9.01 -- Data splitting, scaling, and modeling
You are charged with analyzing the performance of a combined cycle power plant and are given data on the full-load electrical power production along with environmental variables (such as temperature or humidity). In the first part of the activity, you will split the data manually and with sklearn
, then you will scale the data, construct a simple linear model, and output the results:
- For this activity, all you will need is the
Pandas
library, the modules fromsklearn
, andnumpy
. Load them in the first cell of the notebook. - Use the
power_plant.csv
dataset --'Datasets\\power_plant.csv'
. Read the data into aPandas
DataFrame
, print out the shape, and list the first five rows.
The independent variables are as follows:
- AT -- ambient temperature
- V -- exhaust vacuum level
- AP -- ambient pressure
- RH -- relative humidity
The dependent variable is EP -- electrical power produced.
Split the data into a
train
,val
, andtest
set with fractions of 0.8, 0.1, and 0.1 respectively, usingPython
andPandas
but notsklearn
methods. You will use 0.8 for the train split because there is a large number of rows, so the validation and test splits will still have enough rows.Repeat the split in step 3 but use
train_test_split
. Call it once to split thetrain
data, and then call it again to split what remains intoval
andtest
.Ensure that the row counts are correct in all cases.
Fit
.StandardScaler()
to the train data from step 3, and then transformtrain
,validation
, andtest
X
. Do not transform theEP
column, as it is the target.Fit a
.LinearRegression()
model to the scaled train data, using theX
variables to predicty
(theEP
column).Print the
R2
score and theRMSE
of the model on thetrain
,validation
, andtest
datasets.Note