Transforming back to real units

What if we wanted to get our X data back into the original units? To do this, we simply need to invert the equation for the transform and apply it using the scaling parameters.

We've already seen the general equation for the min/max transform. Here is the equation for the standardization transform:

Here, µ is the mean of the data, and s is the standard deviation. However, if we use the sklearn methods, we don't have to do this ourselves, as illustrated by the following snippet. Here, we use the method from the scaler, .inverse_transform(), to restore the car data we transformed in the previous section. As with .transform() or .fit_transform(), the result is a numpy array, so we have to convert back to a DataFrame and restore the column names:

X = scaler.inverse_transform(X_scaled)
X = pd.DataFrame(X)
X.columns = my_data.columns[1:-1]
X.head()

You'll see the following output upon running this code:
Figure 9.28 – The X data transformed back to original units

Figure 9.28 -- The X data transformed back to original units

This concludes our introduction to scaling data for modeling. We've seen along the way how to construct simple linear regression models. Now, let's look into tools in Pandas as well as some additional sklearn methods that are useful for data modeling.

Exercise 9.02 -- Scaling and normalizing data

The Pandas DataFrame structure makes it easy to apply functions to subsets of columns of data. In this exercise, you will use such functionality to scale data. We choose to scale the data because we want to have a common dataset, regardless of the model we choose. Here, you will work again with the weather data from the Austin weather dataset. You need to prepare the data prior to considering models to predict the events. You will load the data, address some issues with the data types, and then apply a scaler to transform the data:

  1. For this exercise, all you will need is the pandas library, numpy, two modules from sklearn, and matplotlib. Load them in the first cell of the notebook:
   import pandas as pd
   from sklearn.model_selection import train_test_split
   from sklearn.preprocessing import StandardScaler
   import matplotlib.pyplot as plt
   import numpy as np

You are going to use the sklearn StandardScaler() method to scale data as preparation for modeling:

  1. It's a good practice to look at data before scaling, so you want to implement the utility function seen earlier in the chapter to plot a grid of histograms. The following code loops over the variables you pass in, checks to see whether there are too many bins (the number of slices in the histogram) and adjusts accordingly, uses a Pandas .hist() method (which uses matplotlib) to plot the histogram in its grid location, and adds a per-chart title that shows the variable. You call the function by passing in a DataFrame, the variables you wish to plot, rows and columns for the grid, and the number of bins. The Pandas slice notation ([:-1] ) is used to pass all but the last column as your data (it doesn't make sense to "plot" the vehicle names). Note that for some variables, there may be only a few unique values, which is why the function modifies the bins in those cases:
   def plot_histogram_grid(df, variables, n_rows, n_cols, bins):
       fig = plt.figure(figsize = (11, 11))
       for i, var_name in enumerate(variables):
           ax = fig.add_subplot(n_rows, n_cols, i + 1)
           if len(np.unique(df[var_name])) <= bins:
             use_bins = len(np.unique(df[var_name]))
           else:
             use_bins = bins   
           df[var_name].hist(bins = use_bins, ax = ax)
           ax.set_title(var_name)
       fig.tight_layout()
       plt.show()
  1. Now, load the austin_weather.csv file into a DataFrame called weather_data, change Events as we did before, and inspect the result:
   weather_data = pd.read_csv('Datasets\\\\austin_weather.csv')
   weather_data.drop(columns = ['Date'], inplace = True)
   weather_data['Events'] = ['None' 
                             if weather_data['Events'][i] is ' '
                            else weather_data['Events'][i]
                            for i in range(weather_data.shape[0])]
   weather_data.describe().T

The result should be as follows:
Figure 9.29 – Using the .describe() method on the data

Figure 9.29 -- Using the .describe() method on the data

  1. From the preceding output, you can see that most of the columns were not read in numerically, as only the TempHighF, TempAvgF and TempLowF columns are present in the describe result. If there were a new, unknown dataset. you'd have to do more EDA to investigate what is in the data and how to address it. In this case, the issue is caused by the use of '-' to represent missing data and the T value in precipitation columns to represent 'trace'. Use the Pandas .replace( ) method to replace '-' with np.nan and T with 0. After the replacement, print a list of rows with missing data.
   weather_data.iloc[:, :-1] = \
       weather_data.iloc[:, :-1].replace(['-', 'T'], 
                                         [np.nan, 0]).astype(float)
   print(weather_data.loc[weather_data.isna().any(axis = 1), :].index) 

Running this code will result in the following output:

Int64Index([174, 175, 176, 177, 596, 597, 598, 638, 639, 741, 742, 953,
            1001, 1107],
           dtype='int64')

Here, the data columns are sliced using :-1 for the columns in .iloc[], which skips the Events column, and then .replace() is used to change the values. The Pandas .replace() method can take lists for the things to replace and the replacement values, which means both '-' and T at the same time. The na_rows code uses the Pandas .isnna() method, which creates a DataFrame the same shape as what is passed, with True or False in it, and then the .any(axis = 1) method chooses any element where the value is True, and by passing axis = 1, that gives us the rows (look across the rows for any True values). Finally, we extract the index values with .index and just print the result. You can see that there aren't very many with missing values now, so dropping those rows is a good approach.

  1. Drop the rows with missing values, verify the result using .describe().T, and then plot histograms of all the variables using the utility function. Use the Pandas .dropna() method with axis = 0, telling the method to drop rows with missing values. Before printing, change the Pandas float format to 2 digits to make the output easier to read:
   weather_data.dropna(axis = 0, inplace = True)
   pd.set_option('display.float_format', lambda x: '%.2f' % x)
   print(weather_data.describe().T)

The output should be similar to the following:
Figure 9.30 – The cleaned data before scaling

Figure 9.30 -- The cleaned data before scaling

  1. Visualize the variable distributions using the utility function to generate a grid of histograms:
   plot_histogram_grid(df = weather_data.iloc[:, :-1], 
                       varaibles = weather_data.iloc[:, :-1].columns, 
                       n_rows = 5, 
                       n_cols = 5,
                       bins = 25)

This produces the following:
Figure 9.31 – The weather data variables before scaling

Figure 9.31 -- The weather data variables before scaling

You can see some interesting features of the data that might affect modeling; for instance, PrecipitationSumInches is mostly 0, and WindHigh has an odd gap near 10 MPH. Several of the variables are skewed. As a first step, we'll choose to proceed with scaling the data.

  1. Recall from the information leakage discussion that when splitting data into, say, train and validation, it's important to split first and then scale; otherwise, the scaler has information about the train data leaking into the validation data. Split the data 70/30. Use .train_test_split(). Remember to split the y values (Events) as well:
   train_X, val_X, train_y, val_y = \
       train_test_split(weather_data.drop(columns = 'Events'), 
                        weather_data['Events'],
                        train_size = 0.7,
                        test_size = 0.2,
                        random_state = 42)
  1. Now, scale all the numeric data using the StandardScaler method, and display the first five rows of the result:
   scaler = StandardScaler()
   scaler = scaler.fit(train_X)
   scaled_train = pd.DataFrame(scaler.transform(train_X))
   scaled_train.columns = weather_data.columns[:-1]
   scaled_val = pd.DataFrame(scaler.transform(val_X))
   scaled_val.columns = weather_data.columns[:-1]
   scaled_train.head()

The result should be as follows:
Figure 9.32 – The scaled train split of the weather data

Figure 9.32 -- The scaled train split of the weather data

At this point, the data is in a form you can use for initial modeling -- you have train and validation splits, and the data is scaled. You should be comfortable with the key concepts of addressing missing or incorrectly formatted or typed data, making two or three splits of the data, fitting a scaler to the train data, and then applying the fitted scaler to the validation (and test) split.

Activity 9.01 -- Data splitting, scaling, and modeling

You are charged with analyzing the performance of a combined cycle power plant and are given data on the full-load electrical power production along with environmental variables (such as temperature or humidity). In the first part of the activity, you will split the data manually and with sklearn, then you will scale the data, construct a simple linear model, and output the results:

  1. For this activity, all you will need is the Pandas library, the modules from sklearn, and numpy. Load them in the first cell of the notebook.
  2. Use the power_plant.csv dataset -- 'Datasets\\power_plant.csv'. Read the data into a Pandas DataFrame, print out the shape, and list the first five rows.

The independent variables are as follows:

  • AT -- ambient temperature
  • V -- exhaust vacuum level
  • AP -- ambient pressure
  • RH -- relative humidity

The dependent variable is EP -- electrical power produced.

  1. Split the data into a train, val, and test set with fractions of 0.8, 0.1, and 0.1 respectively, using Python and Pandas but not sklearn methods. You will use 0.8 for the train split because there is a large number of rows, so the validation and test splits will still have enough rows.

  2. Repeat the split in step 3 but use train_test_split. Call it once to split the train data, and then call it again to split what remains into val and test.

  3. Ensure that the row counts are correct in all cases.

  4. Fit .StandardScaler() to the train data from step 3, and then transform train, validation, and test X. Do not transform the EP column, as it is the target.

  5. Fit a .LinearRegression() model to the scaled train data, using the X variables to predict y (the EP column).

  6. Print the R2 score and the RMSE of the model on the train, validation, and test datasets.

    Note