简体   繁体   中英

How to plot kernel density plot of dates in Pandas?

I have a pandas dataframe where each observation has a date (as a column of entries in datetime[64] format). These dates are spread over a period of about 5 years. I would like to plot a kernel-density plot of the dates of all the observations, with the years labelled on the x-axis.

I have figured out how to create a time-delta relative to some reference date and then create a density plot of the number of hours/days/years between each observation and the reference date:

df['relativeDate'].astype('timedelta64[D]').plot(kind='kde')

But this isn't exactly what I want: If I convert to year-deltas, then the x-axis is right but I lose the within-year variation. But if I take a smaller unit of time like hour or day, the x-axis labels are much harder to interpret.

What's the simplest way to make this work in Pandas?

Inspired by @JohnE 's answer, an alternative approach to convert date to numeric value is to use .toordinal() .

import pandas as pd
import numpy as np

# simulate some artificial data
# ===============================
np.random.seed(0)
dates = pd.date_range('2010-01-01', periods=31, freq='D')
df = pd.DataFrame(np.random.choice(dates,100), columns=['dates'])
# use toordinal() to get datenum
df['ordinal'] = [x.toordinal() for x in df.dates]

print(df)

        dates  ordinal
0  2010-01-13   733785
1  2010-01-16   733788
2  2010-01-22   733794
3  2010-01-01   733773
4  2010-01-04   733776
5  2010-01-28   733800
6  2010-01-04   733776
7  2010-01-08   733780
8  2010-01-10   733782
9  2010-01-20   733792
..        ...      ...
90 2010-01-19   733791
91 2010-01-28   733800
92 2010-01-01   733773
93 2010-01-15   733787
94 2010-01-04   733776
95 2010-01-22   733794
96 2010-01-13   733785
97 2010-01-26   733798
98 2010-01-11   733783
99 2010-01-21   733793

[100 rows x 2 columns]    

# plot non-parametric kde on numeric datenum
ax = df['ordinal'].plot(kind='kde')
# rename the xticks with labels
x_ticks = ax.get_xticks()
ax.set_xticks(x_ticks[::2])
xlabels = [datetime.datetime.fromordinal(int(x)).strftime('%Y-%m-%d') for x in x_ticks[::2]]
ax.set_xticklabels(xlabels)

在此输入图像描述

I imagine there is some better and automatic way to do this, but if not then this ought to be a decent workaround. First, let's set up some sample data:

np.random.seed(479)
start_date = '2011-1-1'
df = pd.DataFrame({ 'date':np.random.choice( 
                    pd.date_range(start_date, periods=365*5, freq='D'), 50) })

df['rel'] = df['date'] - pd.to_datetime(start_date)
df.rel = df.rel.astype('timedelta64[D]')

        date   rel
0 2014-06-06  1252
1 2011-10-26   298
2 2013-08-24   966
3 2014-09-25  1363
4 2011-12-23   356

As you can see, 'rel' is just the number of days since the starting day. It's essentially an integer, so all you really need to do is normalize it with respect to the starting date.

df['year_as_float'] = pd.to_datetime(start_date).year + df.rel / 365.

        date   rel  year_as_float
0 2014-06-06  1252    2014.430137
1 2011-10-26   298    2011.816438
2 2013-08-24   966    2013.646575
3 2014-09-25  1363    2014.734247
4 2011-12-23   356    2011.975342

You'd need to adjust that slightly for a date not starting on Jan 1. That's also ignoring any leap years which really isn't a practical issue if you're just producing a KDE plot over 5 years, but it could matter depending on what else you might want to do.

Here's the plot

df['year_as_float']d.plot(kind='kde')

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM