如何在不读取 rest 的情况下搜索满足一组条件的第一行的 pandas DataFrame？

Question

我有一个巨大的 DataFrame（约 400 万行），我需要在其中搜索具有特定列值约一百万次的行。 根据管理我的问题的条件，每个查询只有一个真实答案（一行）。 因此，只要搜索找到第一个结果，就无需继续搜索。 但是我们知道df.loc[df['column']==value]每次都必须读取所有数据，即使第一行满足搜索条件。 必须读取和评估其他 400 万行吗？ 这为搜索带来了巨大的开销。 有没有办法在不读取和评估行的 rest 的情况下获得满足搜索条件的第一行？

Answer 1

首先，您必须将该列设置为索引（如您所说，您没有重复值）。 然后将您的数据框更改为字典，然后搜索您的值。

In [1]: import numpy as np, pandas as pd
   ...: 
   ...: np.random.seed(4)
   ...: h = 100
   ...: small_df = pd.DataFrame(np.random.randint(1,1000000,h).reshape(h//4,4))
   ...: small_df = small_df.set_index(3)
   ...: small_df.index = small_df.index.astype(str)
   ...: small_df = small_df.loc[small_df.index.drop_duplicates()]
   ...: small_df = small_df.T.to_dict()
   ...: 
   ...: 
   ...: np.random.seed(4)
   ...: h = h*100000
   ...: big_df = pd.DataFrame(np.random.randint(1,1000000000,h).reshape(h//4,4))
   ...: big_df = big_df.set_index(3)
   ...: big_df.index = big_df.index.astype(str)
   ...: big_df = big_df.T.to_dict()
/home/amir/.local/bin/ipython3:17: UserWarning: DataFrame columns are not unique, some columns will be omitted.
len(small_df)
In [2]: len(small_df)
Out[2]: 25

In [3]: len(big_df)
Out[3]: 2496856

In [6]: %time small_df['890932']
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.15 µs
Out[6]: {0: 962341, 1: 751580, 2: 181308}

In [7]: %time big_df  ['115865608']
CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 7.39 µs
Out[7]: {0: 448609773, 1: 372731489, 2: 452798904}

如何在不读取 rest 的情况下搜索满足一组条件的第一行的 pandas DataFrame？

问题描述

1 个解决方案

解决方案1
0 2020-12-02 00:51:55

如何在不读取 rest 的情况下搜索满足一组条件的第一行的 pandas DataFrame？

问题描述

1 个解决方案

解决方案1 0 2020-12-02 00:51:55

解决方案1
0 2020-12-02 00:51:55