通过来自另一个数据框的值列表拆分大熊猫数据框

Question

I'm pretty sure there's a really simple solution for this and I'm just not realising it. 我敢肯定，有一个非常简单的解决方案，我只是没有意识到。 However... 然而...

I have a data frame of high-frequency data. 我有一个高频数据的数据框。 Call this data frame A. I also have a separate list of far lower frequency demarcation points, call this B. I would like to append a column to A that would display 1 if A's timestamp column is between B[0] and B[1], 2 if it is between B[1] and B[2], and so on. 将此数据帧称为A。我还有一个单独的列表，列出了频率较低的分界点，将其称为B。我想向A追加一列，如果A的时间戳列在B [0]和B [1之间，则该列将显示1。 ]，如果它在B [1]和B [2]之间，则为2，依此类推。

As said, it's probably incredibly trivial, and I'm just not realising it at this late an hour. 如前所述，它可能微不足道，而我只是在一个小时的后期才意识到这一点。

Answer 1

Use searchsorted : 使用searchsorted ：

A['group'] = B['timestamp'].searchsorted(A['timestamp'])

For each value in A['timestamp'] , an index value is returned. 对于A['timestamp']每个值， A['timestamp']返回一个索引值。 That index indicates where amongst the sorted values in B['timestamp'] that value from A would be inserted into B in order to maintain sorted order. 该索引指示在B['timestamp']中的排序值中，来自A值将插入到B中以维持排序顺序。

For example, 例如，

import numpy as np
import pandas as pd
np.random.seed(2016)

N = 10
A = pd.DataFrame({'timestamp':np.random.uniform(0, 1, size=N).cumsum()})
B = pd.DataFrame({'timestamp':np.random.uniform(0, 3, size=N).cumsum()})
#    timestamp
# 0   1.739869
# 1   2.467790
# 2   2.863659
# 3   3.295505
# 4   5.106419
# 5   6.872791
# 6   7.080834
# 7   9.909320
# 8  11.027117
# 9  12.383085

A['group'] = B['timestamp'].searchsorted(A['timestamp'])
print(A)

yields 产量

   timestamp  group
0   0.896705      0
1   1.626945      0
2   2.410220      1
3   3.151872      3
4   3.613962      4
5   4.256528      4
6   4.481392      4
7   5.189938      5
8   5.937064      5
9   6.562172      5

Thus, the timestamp 0.896705 is in group 0 because it comes before B['timestamp'][0] (ie 1.739869 ). 因此，时间戳0.896705在组0因为它早于B['timestamp'][0] （即1.739869 ）。 The timestamp 2.410220 is in group 1 because it is larger than B['timestamp'][0] (ie 1.739869 ) but smaller than B['timestamp'][1] (ie 2.467790 ). 时间戳记2.410220在组1因为它大于B['timestamp'][0] （即1.739869 ），但小于B['timestamp'][1] （即2.467790 ）。

You should also decide what to do if a value in A['timestamp'] is exactly equal to one of the cutoff values in B['timestamp'] . 您还应该决定如果A['timestamp']值恰好等于B['timestamp']中的临界值之一，该怎么做。 Use 采用

B['timestamp'].searchsorted(A['timestamp'], side='left')

if you want searchsorted to return i when B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1] . 如果您想当B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]时searchsort返回i 。 Use 采用

B['timestamp'].searchsorted(A['timestamp'], side='right')

if you want searchsorted to return i+1 in that situation. 如果您想让searchsort在这种情况下返回i+1 。 If you don't specify side , then side='left' is used by default. 如果不指定side ，则默认使用side='left' 。

Answer 2

Here is a quick and dirty approach using a list comprehension. 这是一种使用列表推导的快速而肮脏的方法。

>>> df = pd.DataFrame({'A': np.arange(1, 3, 0.2)})

>>> A = df.A.values.tolist()
A: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.5, 2.6, 2.8]

>>> B = np.arange(0, 3, 1).tolist()
B: [0, 1, 2]

>>> BA = [k for k in range(0, len(B)-1) for a in A if (B[k]<=a) & (B[k+1]>a) or (a>max(B))]
BA: [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

通过来自另一个数据框的值列表拆分大熊猫数据框

问题描述

2 个解决方案

解决方案1
2 2016-11-04 02:10:25

解决方案2
2 已采纳

通过来自另一个数据框的值列表拆分大熊猫数据框

问题描述

2 个解决方案

解决方案1 2 2016-11-04 02:10:25

解决方案2 2 已采纳

解决方案1
2 2016-11-04 02:10:25

解决方案2
2 已采纳