是否可以用pandas过滤数亿行数据

Question

最近我一直在处理一个包含近 1 亿行的大型数据集。
完全加载到内存中的文件超过 15GB。 我将所有数据加载到内存中没有问题，因为我有一台带有 96GB 内存的服务器。
这是 info() 的输出：

<class 'modin.pandas.dataframe.DataFrame'>
Int64Index: 97915924 entries, 0 to 117814626
Data columns (total 20 columns):
 #   Column             Non-Null Count     Dtype  
---  -----------------  -----------------  -----  
 0   50p_width          97915924 non-null  float64
 1   80p_width          97915924 non-null  float64
 2   area               97915924 non-null  float64
 3   area_fraction_top  97915924 non-null  float64
 4   center_time        97915924 non-null  float64
 5   event_number       97915924 non-null  int64
 6   event_start_time   97915924 non-null  int64
 7   goodness_of_fit    8205122 non-null   float64
 8   left               97915924 non-null  int64
 9   max_PMT            97915924 non-null  int64
 10  max_PMT_area       97915924 non-null  float64
 11  max_hit_PMT        97915924 non-null  int64
 12  max_hit_area       97915924 non-null  float64
 13  n_PMTS             97915924 non-null  int64
 14  n_hits             97915924 non-null  int64
 15  right              97915924 non-null  int64
 16  run_number         97915924 non-null  int64
 17  type               97915924 non-null  int64
 18  x                  8205122 non-null   float64
 19  y                  8205122 non-null   float64
dtypes: float64(10), int64(10)
memory usage: 15.3 GB

data['exact_time'] = data['center_time'] + data['event_start_time']

我的目标不是直接分析数据，而是过滤它，以便我可以做一些进一步的研究。 类型只能是 0、1 和 2。 data['exact_time']是 ns 中的 unix 时间戳。 我想找出在类型 2 事件之前 1 毫秒（1e6 ns）发生的所有类型 0/类型 1 事件，并找出最大面积对应于所有类型 2 的事件。\\

我想出了 2 种可能的方法，但都需要遍历每一行。
方法一：只过滤时间范围内的所有事件。

s2_data = data[data['type'] == 2]
s2_time_list = s2_data['exact_time'].tolist()
s2_time_list
for t in tqdm(s2_time_list):
    new_data = new_data.append(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])

方法 2：这将遍历整个数据帧一次并获得具有最大面积的 t0/t1 事件以及相应的 t2。

#eventlist = pd.DataFrame().reindex(columns=data.columns)
s2_time = 0
for i, row in tqdm(data.iterrows()):
    if row['type'] == 2:
        s2_time = row['exact_time']
        eventlist = eventlist.loc[eventlist['exact_time'] >= s2_time - 1e6]
        ind = eventlist["area"].idxmax()
        max_row = eventlist.iloc[ind,:]
        new_data = new_data.append(max_row)
        new_data = new_data.append(row)
    else:
        eventlist = eventlist.append(row)

我已经在exact_time 上创建了索引，并尽可能使用modin 库并行处理数据，但两者似乎都非常慢并且需要很长时间才能完成。 我不认为 apply() 或 pd/np 向量化会起作用，因为这需要来自多行的数据，所以我想知道是否有更好的方法可以做到这一点。

Answer 1

重新编写方法 1 以获取数据帧列表，然后进行连接。 从我所看到的，您在每次迭代中都附加了一个数据框。

我把这个例子放在一起来说明。 在具有 12 GB 内存的 PC 上创建了一个包含 500,000 行的数据帧（用于比较）。 将时间从 46 秒减少到 25 秒（这样您至少可以在那里获得一半的时间）。

tt = 1605067706567342
n=500000
exact_time = []
for i in range(n):
    tt = tt + 99999
    # print(t)
    exact_time.append(tt)

type = [0,0,0,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,0,1,1,1,1]*10000

data = pd.DataFrame({'exact_time':exact_time,'type':type}, columns=['exact_time', 'type'])
print(data)
s2_data = data[data['type'] == 2]
s2_time_list = s2_data['exact_time'].tolist()
len(s2_time_list)

####Your method 1 - # 500,000 rows; 20,000 type 2 --- **46.63247585296631 seconds**
starttime = time.time()
new_data = pd.DataFrame()
for t in s2_time_list:
    # print(t)
    # print(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])
    new_data = new_data.append(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])
print(time.time()-starttime)
 
#### Reworked to create list of dfs then concatenate
### # 500,000 rows; 20,000 type 2 --- **25.308656930923462 seconds**
starttime = time.time()
new_data_list = []
for t in s2_time_list:
    # print(t)
    # print(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])
    new_data_list.append(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])
new_df = pd.concat(new_data_list, axis=0)
print(time.time()-starttime)

是否可以用pandas过滤数亿行数据

问题描述

1 个解决方案

解决方案1
0 2020-11-11 05:55:51

是否可以用pandas过滤数亿行数据

问题描述

1 个解决方案

解决方案1 0 2020-11-11 05:55:51

解决方案1
0 2020-11-11 05:55:51