[英]Split a pandas dataframe by a list of values from another data frame
I'm pretty sure there's a really simple solution for this and I'm just not realising it. 我敢肯定,有一个非常简单的解决方案,我只是没有意识到。 However...
然而...
I have a data frame of high-frequency data. 我有一个高频数据的数据框。 Call this data frame A. I also have a separate list of far lower frequency demarcation points, call this B. I would like to append a column to A that would display 1 if A's timestamp column is between B[0] and B[1], 2 if it is between B[1] and B[2], and so on.
将此数据帧称为A。我还有一个单独的列表,列出了频率较低的分界点,将其称为B。我想向A追加一列,如果A的时间戳列在B [0]和B [1之间,则该列将显示1。 ],如果它在B [1]和B [2]之间,则为2,依此类推。
As said, it's probably incredibly trivial, and I'm just not realising it at this late an hour. 如前所述,它可能微不足道,而我只是在一个小时的后期才意识到这一点。
Use searchsorted
: 使用
searchsorted
:
A['group'] = B['timestamp'].searchsorted(A['timestamp'])
For each value in A['timestamp']
, an index value is returned. 对于
A['timestamp']
每个值, A['timestamp']
返回一个索引值。 That index indicates where amongst the sorted values in B['timestamp']
that value from A
would be inserted into B
in order to maintain sorted order. 该索引指示在
B['timestamp']
中的排序值中,来自A
值将插入到B
中以维持排序顺序。
For example, 例如,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 10
A = pd.DataFrame({'timestamp':np.random.uniform(0, 1, size=N).cumsum()})
B = pd.DataFrame({'timestamp':np.random.uniform(0, 3, size=N).cumsum()})
# timestamp
# 0 1.739869
# 1 2.467790
# 2 2.863659
# 3 3.295505
# 4 5.106419
# 5 6.872791
# 6 7.080834
# 7 9.909320
# 8 11.027117
# 9 12.383085
A['group'] = B['timestamp'].searchsorted(A['timestamp'])
print(A)
yields 产量
timestamp group
0 0.896705 0
1 1.626945 0
2 2.410220 1
3 3.151872 3
4 3.613962 4
5 4.256528 4
6 4.481392 4
7 5.189938 5
8 5.937064 5
9 6.562172 5
Thus, the timestamp 0.896705
is in group 0
because it comes before B['timestamp'][0]
(ie 1.739869
). 因此,时间戳
0.896705
在组0
因为它早于B['timestamp'][0]
(即1.739869
)。 The timestamp 2.410220
is in group 1
because it is larger than B['timestamp'][0]
(ie 1.739869
) but smaller than B['timestamp'][1]
(ie 2.467790
). 时间戳记
2.410220
在组1
因为它大于B['timestamp'][0]
(即1.739869
),但小于B['timestamp'][1]
(即2.467790
)。
You should also decide what to do if a value in A['timestamp']
is exactly equal to one of the cutoff values in B['timestamp']
. 您还应该决定如果
A['timestamp']
值恰好等于B['timestamp']
中的临界值之一,该怎么做。 Use 采用
B['timestamp'].searchsorted(A['timestamp'], side='left')
if you want searchsorted to return i
when B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]
. 如果您想当
B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]
时searchsort返回i
。 Use 采用
B['timestamp'].searchsorted(A['timestamp'], side='right')
if you want searchsorted to return i+1
in that situation. 如果您想让searchsort在这种情况下返回
i+1
。 If you don't specify side
, then side='left'
is used by default. 如果不指定
side
,则默认使用side='left'
。
Here is a quick and dirty approach using a list comprehension. 这是一种使用列表推导的快速而肮脏的方法。
>>> df = pd.DataFrame({'A': np.arange(1, 3, 0.2)})
>>> A = df.A.values.tolist()
A: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.5, 2.6, 2.8]
>>> B = np.arange(0, 3, 1).tolist()
B: [0, 1, 2]
>>> BA = [k for k in range(0, len(B)-1) for a in A if (B[k]<=a) & (B[k+1]>a) or (a>max(B))]
BA: [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.