計算數據幀列的最快方法

Question

我得到了一個我需要幫助的熊貓問題。

一方面，我有一個如下所示的DataFrame：

   contributor_id     timestamp     edits    upper_month   lower_month
0      8             2018-01-01       1      2018-04-01    2018-02-01
1      26424341      2018-01-01       11     2018-04-01    2018-02-01
10     26870381      2018-01-01       465    2018-04-01    2018-02-01
22     28109145      2018-03-01       17     2018-06-01    2018-04-01
23     32769624      2018-01-01       84     2018-04-01    2018-02-01
25     32794352      2018-01-01       4      2018-04-01    2018-02-01

另一方面，我有（在另一個DF中可用），給定的日期索引：

2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01, 2018-05-01, 2018-06-01, 2018-07-01, 2018-08-01, 2018-09-01, 2018-10-01, 2018-11-01, 2018-12-01.

我需要創建一個pd.Series，它具有以前顯示的索引作為索引。 對於索引中的每個日期，pd.Series的數據必須是：

如果date> = lower_month並且date <= upper_month，那么我添加1。

目標是按每個日期計算日期在前一個DataFrame中的上月和下月值之間的次數。

此案例的示例輸出pd.Series將是：

2018-01-01    0
2018-02-01    5
2018-03-01    5
2018-04-01    6
2018-05-01    1
2018-06-01    1
2018-07-01    0
2018-08-01    0
2018-09-01    0
2018-10-01    0
2018-11-01    0
2018-12-01    0

有沒有一種快速的方法來進行這種計算，避免大量遍歷第一個數據幀？

謝謝你們。

Answer 1

對於轉換為元組的壓縮列和范圍內的值之間的測試成員資格，使用列表DataFrame和DataFrame ，在生成器中創建DataFrame和sum ：

rng = pd.date_range('2018-01-01', freq='MS', periods=12)
vals = list(zip(df['lower_month'], df['upper_month']))

s = pd.Series({y: sum(y >= x1 and y <= x2 for x1, x2 in vals) for y in rng})

編輯：

為了更好的性能使用count方法，謝謝@Stef：

s = pd.Series({y: [y >= x1 and y <= x2 for x1, x2 in vals].count(True) for y in rng})

print (s)
2018-01-01    0
2018-02-01    5
2018-03-01    5
2018-04-01    6
2018-05-01    1
2018-06-01    1
2018-07-01    0
2018-08-01    0
2018-09-01    0
2018-10-01    0
2018-11-01    0
2018-12-01    0
dtype: int64

表現：

np.random.seed(123)

def random_dates(start, end, n=10000):

    start_u = start.value//10**9
    end_u = end.value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s').floor('d')


d1 = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01')) + pd.offsets.MonthBegin(0)
d2 = random_dates(pd.to_datetime('2018-01-01'), pd.to_datetime('2020-01-01')) + pd.offsets.MonthBegin(0)

df = pd.DataFrame({'lower_month':d1, 'upper_month':d2})
rng = pd.date_range('2015-01-01', freq='MS', periods=6 * 12)
vals = list(zip(df['lower_month'], df['upper_month']))

In [238]: %timeit pd.Series({y: [y >= x1 and y <= x2 for x1, x2 in vals].count(True) for y in rng})
158 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [239]: %timeit pd.Series({y: sum(y >= x1 and y <= x2 for x1, x2 in vals) for y in rng})
221 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#first solution is slow    
In [240]: %timeit pd.DataFrame([(y, y >= x1 and y <= x2) for x1, x2 in vals for y in rng],                  columns=['d','test']).groupby('d')['test'].sum().astype(int)
4.52 s ± 396 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 2

我使用itertools為每個index_date重復upper_month和lower月

然后比較每個lower_month_month的index_date並設置臨時列

check = 1

然后按index_date進行分組后檢查

import pandas as pd
from pandas.compat import StringIO, BytesIO
import itertools

#sample data
data = ('contributor_id,timestamp,edits,upper_month,lower_month\n'
'8,2018-01-01,1,2018-04-01,2018-02-01\n'
'26424341,2018-01-01,11,2018-04-01,2018-02-01\n'
'26870381,2018-02-01,465,2018-04-01,2018-02-01\n'
'28109145,2018-03-01,17,2018-06-01,2018-04-01\n')

orig_df = pd.read_csv(StringIO(data))

# sample index_dates
index_df = list(pd.Series(["2018-01-01", "2018-02-01"]))

# repeat upper_month and lower_month using itertools.product
abc = list(orig_df[['upper_month','lower_month']].values)
combine_list = [index_df,abc]
res = list(itertools.product(*combine_list))
df = pd.DataFrame(res,columns=["timestamp","range"])

#separate lower_month and upper_month from  range 
df['lower_month'] = df['range'].apply(lambda x : x[1])
df['upper_month'] = df['range'].apply(lambda x : x[0])
df.drop(['range'],axis=1,inplace=True)

# convert all dates column to make them consistent
orig_df['timestamp'] = pd.to_datetime(orig_df['timestamp']).dt.date.astype(str)
orig_df['upper_month'] = pd.to_datetime(orig_df['upper_month']).dt.date.astype(str)
orig_df['lower_month'] = pd.to_datetime(orig_df['lower_month']).dt.date.astype(str)

#apply condition to set check 1
df.loc[(df["timestamp"]>=df['lower_month']) & (df["timestamp"]<=df['upper_month']),"check"] = 1

#simply groupby to count check
res = df.groupby(['timestamp'])['check'].sum()

print(res)

timestamp
2018-01-01    0.0
2018-02-01    3.0

計算數據幀列的最快方法

問題描述

2 個解決方案

解決方案1
3 2019-07-17 10:40:13

解決方案2
0 2019-07-17 11:15:14

計算數據幀列的最快方法

問題描述

2 個解決方案

解決方案1 3 2019-07-17 10:40:13

解決方案2 0 2019-07-17 11:15:14

解決方案1
3 2019-07-17 10:40:13

解決方案2
0 2019-07-17 11:15:14