简体   繁体   English

通过来自另一个数据框的值列表拆分大熊猫数据框

[英]Split a pandas dataframe by a list of values from another data frame

I'm pretty sure there's a really simple solution for this and I'm just not realising it. 我敢肯定,有一个非常简单的解决方案,我只是没有意识到。 However... 然而...

I have a data frame of high-frequency data. 我有一个高频数据的数据框。 Call this data frame A. I also have a separate list of far lower frequency demarcation points, call this B. I would like to append a column to A that would display 1 if A's timestamp column is between B[0] and B[1], 2 if it is between B[1] and B[2], and so on. 将此数据帧称为A。我还有一个单独的列表,列出了频率较低的分界点,将其称为B。我想向A追加一列,如果A的时间戳列在B [0]和B [1之间,则该列将显示1。 ],如果它在B [1]和B [2]之间,则为2,依此类推。

As said, it's probably incredibly trivial, and I'm just not realising it at this late an hour. 如前所述,它可能微不足道,而我只是在一个小时的后期才意识到这一点。

Use searchsorted : 使用searchsorted

A['group'] = B['timestamp'].searchsorted(A['timestamp'])

For each value in A['timestamp'] , an index value is returned. 对于A['timestamp']每个值, A['timestamp']返回一个索引值。 That index indicates where amongst the sorted values in B['timestamp'] that value from A would be inserted into B in order to maintain sorted order. 该索引指示在B['timestamp']中的排序值中,来自A值将插入到B中以维持排序顺序。

For example, 例如,

import numpy as np
import pandas as pd
np.random.seed(2016)

N = 10
A = pd.DataFrame({'timestamp':np.random.uniform(0, 1, size=N).cumsum()})
B = pd.DataFrame({'timestamp':np.random.uniform(0, 3, size=N).cumsum()})
#    timestamp
# 0   1.739869
# 1   2.467790
# 2   2.863659
# 3   3.295505
# 4   5.106419
# 5   6.872791
# 6   7.080834
# 7   9.909320
# 8  11.027117
# 9  12.383085

A['group'] = B['timestamp'].searchsorted(A['timestamp'])
print(A)

yields 产量

   timestamp  group
0   0.896705      0
1   1.626945      0
2   2.410220      1
3   3.151872      3
4   3.613962      4
5   4.256528      4
6   4.481392      4
7   5.189938      5
8   5.937064      5
9   6.562172      5

Thus, the timestamp 0.896705 is in group 0 because it comes before B['timestamp'][0] (ie 1.739869 ). 因此,时间戳0.896705在组0因为它早于B['timestamp'][0] (即1.739869 )。 The timestamp 2.410220 is in group 1 because it is larger than B['timestamp'][0] (ie 1.739869 ) but smaller than B['timestamp'][1] (ie 2.467790 ). 时间戳记2.410220在组1因为它大于B['timestamp'][0] (即1.739869 ),但小于B['timestamp'][1] (即2.467790 )。


You should also decide what to do if a value in A['timestamp'] is exactly equal to one of the cutoff values in B['timestamp'] . 您还应该决定如果A['timestamp']值恰好等于B['timestamp']中的临界值之一,该怎么做。 Use 采用

B['timestamp'].searchsorted(A['timestamp'], side='left')

if you want searchsorted to return i when B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1] . 如果您想当B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]时searchsort返回i Use 采用

B['timestamp'].searchsorted(A['timestamp'], side='right')

if you want searchsorted to return i+1 in that situation. 如果您想让searchsort在这种情况下返回i+1 If you don't specify side , then side='left' is used by default. 如果不指定side ,则默认使用side='left'

Here is a quick and dirty approach using a list comprehension. 这是一种使用列表推导的快速而肮脏的方法。

>>> df = pd.DataFrame({'A': np.arange(1, 3, 0.2)})

>>> A = df.A.values.tolist()
A: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.5, 2.6, 2.8]

>>> B = np.arange(0, 3, 1).tolist()
B: [0, 1, 2]

>>> BA = [k for k in range(0, len(B)-1) for a in A if (B[k]<=a) & (B[k+1]>a) or (a>max(B))]
BA: [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM