Using Time in pandas

In this chapter, you will focus entirely on one kind of data that is, on one hand, quite common in data analysis but, on the other hand, requires a number of special considerations -- time series.

pandas provides many methods specifically for working with time series. Here, you will learn about time-aware data types (such as datetime and Timedelta). In Chapter 13, Exploring Time Series, you will learn how to use these in the index to enable advanced capabilities such as resampling to different time intervals, interpolating, and modeling as a function of time.

You will cover the following topics as you work through this chapter:

  • What are datetimes?
  • Activity 12.01 -- understanding power usage
  • Datetime math operations

Introduction to time series

Time series data is nearly ubiquitous but can be a pain point in many analyses. For example, suppose you are asked to forecast sales for a retail store and are given daily sales figures for the last 6 months. When you review the data, you realize the store is usually open 5 days a week but sometimes has sales on Saturdays and even some Sundays. This makes most weekend days have missing values, and the time interval of the data is inconsistent. Also, when you consider estimating a monthly forecast, you realize months are of different lengths and have varying numbers of sales days. As simple and obvious as the issues are, they create a number of issues in analyzing and modeling the data over time.

The machine learning literature and popular articles are heavily biased toward classification problems, with little mention of time series. Yet much of the data we deal with is time series or at least starts out that way. Time series is a general term used to refer to data that is naturally ordered by time. For example, tweets arrive as a stream of timestamped data. Similarly, store transactions or online credit card transactions are time series. The log streams from data centers are time series.

It's important to note that, unlike tabular data in classification problems, time series data is ordered . In tabular data, random samples are shuffled before being used in a model. In time series, the order matters and we generally want to preserve it. The temporal relationship of events is critical; we can only recognize unusual server traffic if we analyze the sequence of data compared to normal use periods. The time sequence of store transactions can be compared day to day and over longer periods to anticipate high-demand periods for inventory and staff planning. The examples are endless.

pandas has a wide range of features to work with time series data. In the pandas documentation, it is noted that pandas time series objects are based on NumPy datetime64 and timedelta64 object types. pandas consolidates some useful methods from libraries such as scikit.timeseries (so much so that pandas will eventually absorb this library), and adds a lot of additional functionality used for working with time series data. In this chapter, we'll introduce some of the more important capabilities and review how to deal with timestamps in data. The key to understanding how time series differs from other pandas data structures is that pandas provides a couple of additional object types, namely Timestamp and Timedelta, as well as Period; we will review these beginning in the next section, What are datetimes?. Also, recall the importance of the index in pandas; for many operations with time series, we will make index one of these new object types, instead of working exclusively with integers or labels, as we did in the previous chapters till now. Making index a timestamp, for example, enables new functionality to simplify manipulating time series. Let's get started by understanding datetimes.