Dealing with messy data
Messy data can occur for a wide variety of reasons.
For example, there are various forms of missing data, such as N/A, NA, None, Null, or any arbitrary number (in other words, -1, 999, 10,000, and more). It is important for analysts to understand the business meaning of the dataset they are handling during the data preparation process. By knowing the nature of missing values, the way that missing values are shown, and the data collection procedures that have triggered the occurrence of missing values, they can choose the best way to interpret this type of data.
Working on data without column headers
Often, the column headers in your data hold the preliminary information and business meaning. However, there is a chance that the column headers will be absent. This results in no specific information that can be derived to help understand the relationship between the headers and the content of the data.
Let's start with the example that we previously saw in Chapter 4, pandas Data Types:
# Importing pandas
import pandas as pd
# Defining lists
row1 = list([1001.0, 'Pandas Banking', 235000, 248000, 5.5, 2013,3,10, 0])
row2 = list([1002.0, 'Pandas Grocery', 196000, 205000, 4.5, 2016,4,30, 0])
row3 = list([1003.0, 'Pandas Telecom', 167000, 193000, 15.5, 2010,11,24, 0])
row4 = list([1004.0, 'Pandas Transport', 79000, 90000, 13.9, 2018,1,15, 1])
row5 = list([1005.0, 'Pandas Insurance', 241000, 264000, 9.5, 2009,6,1, 0])
# Defining a DataFrame
data_frame = pd.DataFrame(data=[row1, row2, row3, row4, row5])
# Display DataFrame values
data_frame
The output will be as follows:
Figure 7.1 -- A DataFrame without column headers
As you can see, the DataFrame doesn't show any column headers when it is generated. Luckily, pandas will generate unique numerical column headers in order for you to use the DataFrame.
Use the info()
method to get more details about the DataFrame:
# Display DataFrame info
data_frame.info()
The output will be as follows:
Figure 7.2 -- DataFrame information
Notice that pandas is using the column index number as the column header.
As demonstrated in Chapter 4, pandas Data Types, where you learned about dtypes, column 0 is currently a float. It needs to be converted into an int.
Convert column 0 into an int, as follows:
# Convert column 0 into a int
data_frame[0] = data_frame[0].astype('int')
data_frame[0]
The output will be as follows:
Figure 7.3 -- The conversion of the column into an int
Now, try summing columns 2 and 3 together (the result will not be stored inside the DataFrame):
# Summing column 2 and column 3 together
data_frame[2] + data_frame[3]
The output will be as follows:
Figure 7.4 -- A summation of columns 2 and 3
Here, you can see that every operation can still be performed on the columns by using their generated headers instead of their real names. Of course, working with data without column headers is not recommended as it leads to confusion and error.
Next, you will perform data manipulation to fix the missing headers:
- First, define a list of column headers that will replace the missing column headers:
# Create list of column headers
column_names = ["Customer ID", "Customer Name", "2018 Revenue", "2019 Revenue", "Growth", "Start Year", "Start Month", "Start Day", "New Customer"]
column_names
The output will be as follows:
Figure 7.5 -- Defining the column headers
- Now, use the
.columns
attribute of the DataFrame to replace the missing column headers:
# Replace missing column headers
data_frame.columns = column_names
data_frame
The output will be as follows:
Figure 7.6 -- The DataFrame after replacing the column headers
The following snippet is another example of how you can import a CSV file without column headers:
# Importing pandas
import pandas as pd
#Define the csv URL
file_url = ' https://raw.githubusercontent.com/PacktWorkshops/The-Pandas-Workshop/master/Chapter07/Data/retail_purchase_missing_headers.csv '
# Import the csv as a DataFrame
data_frame = pd.read_csv(file_url)
#Display the DataFrame
data_frame
The output will be as follows:
Figure 7.7 -- An alternative DataFrame example
As you can see, you have an issue with the column headers, as pandas is interpreting the first row of our CSV file as the column header. To avoid this situation, you can add the header=None
parameter to specify to pandas that you don't have any column headers when you import the CSV file:
# Import the csv as a DataFrame
data_frame = pd.read_csv(file_url, header=None)
#Display the DataFrame
data_frame
The output will be as follows:
Figure 7.8 -- Replacing the column headers with numbers
Here, pandas has imported the CSV file and generated the column headers. Now, you can replace the column headers:
# Create list of column headers
column_names = ["Receipt Id", "Date of Purchase", "Product Name", "Product Weight", "Total Price", "Retail shop name"]
# Replace missing column headers
data_frame.columns = column_names
data_frame
The output will be as follows:
Figure 7.9 -- Adding the correct headers
Now that you have learned how to create and assign column headers in your data, it is strongly recommended that you always have meaningful column headers, as this will help you to better understand your dataset.