简体   繁体   English

是否有 function 来获取 pandas dataframe 时间序列上两个值之间的差异?

[英]Is there a function to get the difference between two values on a pandas dataframe timeseries?

I am messing around in the NYT covid dataset which has total covid cases for each county, per day.我在NYT covid 数据集上闲逛,该数据集每天都有每个县的 covid 病例总数。

I would like to find out the difference of cases between each day, so theoretically I could get the number of new cases per day instead of total cases.我想找出每天之间的案例差异,所以理论上我可以获得每天的新案例数量而不是总案例数。 Taking a rolling mean, or resampling every 2 days using a mean/sum/etc all work just fine.采用滚动平均值,或使用平均值/总和/等每 2 天重新采样都可以正常工作。 It's just subtracting that is giving me such a headache.只是减去这让我很头疼。

Tried methods:尝试过的方法:

  • df.resample('2d').diff()
    • 'DatetimeIndexResampler' object has no attribute 'diff' 'DatetimeIndexResampler' object 没有属性 'diff'

  • df.resample('1d').agg(np.subtract)
    • ufunc() missing 1 of 2required positional argument(s) ufunc() 缺少 2 个必需位置参数中的 1 个

  • df.rolling(2).diff()
    • 'Rolling' object has no attribute 'diff' 'Rolling' object 没有属性 'diff'

  • df.rolling('2').agg(np.subtract)
    • ufunc() missing 1 of 2required positional argument(s) ufunc() 缺少 2 个必需位置参数中的 1 个

Sample data:样本数据:

pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
               'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
               'covid_cases':[1.2,2.0,2.9,3.6,3.9]
              })

在此处输入图像描述

Desired sample output:所需样本 output:

pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
               'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
               'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3]
              })

在此处输入图像描述

Recreate sample data from original NYT dataset:从原始 NYT 数据集重新创建示例数据:

df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates=['date'])
df.groupby(['state','date'])[['cases']].mean().reset_index()

Any help would be greatly appreciated.任何帮助将不胜感激。 Would like to learn how to do this manually/via function rather than finding a "new cases" dataset as I will be working with timeseries a lot in the very near future.想学习如何手动/通过 function 执行此操作,而不是查找“新案例”数据集,因为我将在不久的将来大量使用时间序列。

Let's try this bit of complete code:让我们试试这段完整的代码:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')

df['date'] = pd.to_datetime(df['date'])

df_daily_state = df.groupby(['date','state'])['cases'].sum().unstack()

daily_new_cases_AL = df_daily_state.diff()['Alabama']

ax = daily_new_cases_AL.iloc[-30:].plot.bar(title='Last 30 days Alabama New Cases')

Output: Output:

在此处输入图像描述

Details:细节:

  • Download the historical case records from NYTimes github using the raw URL使用原始 URL 从 NYTimes github 下载历史案例记录
  • Convert the dtype of the 'date' column to datetime dtype将“日期”列的数据类型转换为日期时间数据类型
  • Groupby 'date' and 'state' columns sum 'cases' and unstack the state level of the index to get dates of rows and states for columns. Groupby 'date' 和 'state' 列对 'cases' 求和并拆开索引的 state 级别以获得行的日期和列的状态。
  • Take the difference by columns and select only the Alabama column按列取差和 select 只有阿拉巴马列
  • Plot the last 30 days Plot 最后30天

The diff function is correct, but if you look at your error message: diff function 是正确的,但是如果您查看错误消息:

'DatetimeIndexResampler' object has no attribute 'diff'

in your first tried methods, it's because diff is a function available for DataFrames, not for Resamplers, so turn it back into a DataFrame by specifying how you want to resample it.在您第一次尝试的方法中,这是因为 diff 是一个 function 可用于 DataFrames,而不是 Resamplers,因此通过指定您要如何对其进行重新采样将其变回 DataFrame。

If you have the total number of COVID cases for each day and want to resample it to 2 days, you probably only want to keep the latest update out of the two days, in which case something like df.resample('2d').last().diff() should work.如果您有每天的 COVID 病例总数并想将其重新采样为 2 天,您可能只想保留这两天的最新更新,在这种情况下,类似于df.resample('2d').last().diff()应该有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM