简体   繁体   English

如何在 Seaborn distplot 中绘制 Pandas 日期时间序列?

[英]How to plot Pandas datetime series in Seaborn distplot?

I have a pandas dataframe with a datetime column.我有一个带有日期时间列的熊猫数据框。 I would like to plot the distribution of the rows according to that date column, but I'm currenty getting an unhelpful error.我想根据该日期列绘制行的分布,但我目前遇到了一个无益的错误。 I have:我有:

df['Date'] = pd.to_datetime(df['Date'], errors='raise')
s = sns.distplot(df['Date'])

which throws the error:抛出错误:

TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]')

If I change the column I'm plotting to numeric data then it all works fine.如果我将要绘制的列更改为数字数据,则一切正常。 How can I get the datetime column to behave nicely?我怎样才能让日期时间列表现得很好? I can't really find much about what I think I need in the docs.我在文档中找不到太多关于我认为我需要的内容。 Any and all help appreciated.任何和所有的帮助表示赞赏。

The below is the result of df.head(2) , I have removed some columns for security reasons etc:以下是df.head(2)的结果,出于安全原因等,我删除了一些列:

               Date                 
2812         2016-03-05
2813         2016-03-05

Apparently the column (when taken as a series) has properties显然该列(作为一个系列)具有属性

Name: Date, dtype: datetime64[ns]

I came across this question while having the same problem myself.我自己遇到了同样的问题时遇到了这个问题。 As mentioned in comments, it seems like seaborn's distplot doesn't support dates to work with.正如评论中提到的,seaborn 的distplot似乎不支持使用日期。 Unfortunately, I could not find anything in official documentation to support this claim.不幸的是,我在官方文档中找不到任何内容来支持这一说法。

I found two ways to deal with this problem.我找到了两种方法来处理这个问题。 None of them is perfect, yet that's the best I found.它们都不是完美的,但这是我发现的最好的。

Option 1: Convert dates to numbers选项 1:将日期转换为数字

Convert to some numeric metric and work with that.转换为一些数字度量并使用它。 displot works with numbers, so if each date was represented by a number we will be okay. displot处理数字,所以如果每个日期都用一个数字表示,我们就可以了。 The mapping between dates and numbers is kinda like use MinMax Scaler.日期和数字之间的映射有点像使用 MinMax Scaler。 For example, We can set "2017-01-01" as 0 and "2020-06-06" as 1, and map all dates between them to values in range [0,1].例如,我们可以设置“2017-01-01”为0,“2020-06-06”为1,并将它们之间的所有日期映射到[0,1]范围内的值。

What range of numbers to use it's depends on the range of your data, could be days/months/ years or etc.使用的数字范围取决于您的数据范围,可能是天/月/年等。

I'll demonstrate this approach with this toy example.我将通过这个玩具示例演示这种方法。

import pandas as pd
import datetime as dt

original_dates = ["2016-03-05", "2016-03-05", "2016-02-05", "2016-02-05", "2016-02-05", "2014-03-05"]
dates_list = [dt.datetime.strptime(date, '%Y-%m-%d').date() for date in original_dates]

df = pd.DataFrame({"Date":dates_list})

now dataframe is as follows:现在数据框如下:

         Date
0  2016-03-05
1  2016-03-05
2  2016-02-05
3  2016-02-05
4  2016-02-05
5  2014-03-05

(not the best way to enter dates to dataframe of course, but it doesn't matter how). (当然,这不是将日期输入到数据框的最佳方式,但方式无关紧要)。

Now I create a new column which will hold the difference in days between minimum date:现在我创建一个新列,它将保存最小日期之间的天数差异:

df["NewDate"] = df["Date"] - dt.date(2014,3,5)
df["NewDate"] = df["NewDate"].apply(lambda x: x.days)

result:结果:

         Date  NewDate
0  2016-03-05      731
1  2016-03-05      731
2  2016-02-05      702
3  2016-02-05      702
4  2016-02-05      702
5  2014-03-05        0

notice I "hard-coded" the minimum date.注意我“硬编码”了最小日期。 You can use better ways to find minimum and not hard-coded it.您可以使用更好的方法来查找最小值而不是对其进行硬编码。 I just wanted to get this part as fast as possible.我只是想尽快得到这部分。

Now we can use displot on our new column:现在我们可以在我们的新列上使用displot

import seaborn as sns
sns.set()
ax = sns.distplot(df['NewDate'])

output:输出:

带有日期的 Seaborn displot

As you can see, it shows the days instead of dates.如您所见,它显示的是日期而不是日期。 For my personal problem it was okay to show it that way.对于我的个人问题,以这种方式展示它是可以的。 If you want to show it as dates, some extra step is needed: Show xticks which are function of x-axis, not directly the data it self.如果要将其显示为日期,则需要一些额外的步骤: 显示 x 轴函数的 xticks,而不是直接显示数据本身。 Example with dates (pandas, matplotlib) 日期示例(熊猫,matplotlib)

As I said earlier, I used scaling by days difference but you can do the same with months or years.正如我之前所说,我使用天差缩放,但您可以使用数月或数年进行相同的缩放。 Depends on the data.取决于数据。

Option 2: Use histogram directly without seaborn's displot选项2:直接使用直方图,不用seaborn的displot

In this question: Can Pandas plot a histogram of dates?在这个问题中: Pandas 可以绘制日期的直方图吗? there is an answer how to plot histogram with dates, using pandas's groupby .有一个答案如何使用熊猫的groupby绘制带有日期的直方图。

It's not the same as displot , but it can be close-enough solution (as displot eventually is based on matplotlib's hist).它与displot ,但它可以是足够接近的解决方案(因为 displot 最终基于 matplotlib 的历史)。

You could convert the dates to Categorical type, and plot the resulting codes (which are integers).您可以将日期转换为 Categorical 类型,并绘制结果代码(整数)。 Then, label the x-ticks with the Date (as category).然后,用日期(作为类别)标记 x-ticks。

import pandas as pd
import seaborn as sns

original_dates = [
    "2016-03-05", "2016-03-05", "2016-02-05",
    "2016-02-05", "2016-02-05", "2014-03-05"]
dates_list = pd.to_datetime(original_dates)

df = pd.DataFrame({"Date": dates_list})
df['date-as-cat'] = df['Date'].astype('category')  # new 
df['codes'] = df['date-as-cat'].cat.codes          # new 

print(df)
print(df.dtypes)

        Date date-as-cat  codes
0 2016-03-05  2016-03-05      2
1 2016-03-05  2016-03-05      2
2 2016-02-05  2016-02-05      1
3 2016-02-05  2016-02-05      1
4 2016-02-05  2016-02-05      1
5 2014-03-05  2014-03-05      0

Date           datetime64[ns]
date-as-cat          category
codes                    int8
dtype: object 

The date-as-code and date-as-category info is obtained like this: date-as-code 和 date-as-category 信息是这样获得的:

x = df[['codes', 'date-as-cat']].drop_duplicates().sort_values('codes')
print(x)

   codes date-as-cat
5      0  2014-03-05
2      1  2016-02-05
0      2  2016-03-05

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM