在python中从大型数据帧中快速采样大量行

Question

I have a very large dataframe (about 1.1M rows) and I am trying to sample it. 我有一个非常大的数据框（约110万行），我正在尝试对其进行采样。

I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe. 我有一个索引列表（大约70,000个索引），我想从整个数据框中选择。

This is what Ive tried so far but all these methods are taking way too much time: 我到目前为止已经尝试过了，但是所有这些方法都花费了太多时间：

Method 1 - Using pandas : 方法1-使用熊猫：

sample = pandas.read_csv("data.csv", index_col = 0).reset_index()
sample = sample[sample['Id'].isin(sample_index_array)]

Method 2 : 方法2：

I tried to write all the sampled lines to another csv. 我试图将所有采样行写入另一个csv。

f = open("data.csv",'r')

out  = open("sampled_date.csv", 'w')
out.write(f.readline())

while 1:
    total += 1
    line = f.readline().strip()

    if line =='':
        break
    arr = line.split(",")

    if (int(arr[0]) in sample_index_array):
        out.write(",".join(e for e in (line)))

Can anyone please suggest a better method? 任何人都可以提出更好的方法吗？ Or how I can modify this to make it faster? 或者如何修改它以使其更快？

Thanks 谢谢

Answer 1

We don't have your data, so here is an example with two options: 我们没有您的数据，因此这是一个带有两个选项的示例：

after reading : use a pandas Index object to select a subset via the .iloc selection method 阅读后 ：使用熊猫Index对象通过.iloc 选择方法选择一个子集
while reading : a predicate with the skiprows parameter 阅读时 ：带有skiprows参数的谓词

Given 给定

A collection of indices and a (large) sample DataFrame written to test.csv : 索引的集合和（大）样本DataFrame写入test.csv ：

import pandas as pd
import numpy as np


indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776]

df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD"))
df.to_csv("test.csv", header=False)
df.info()

Output 输出量

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
A    1000000 non-null int32
B    1000000 non-null int32
C    1000000 non-null int32
D    1000000 non-null int32
dtypes: int32(4)
memory usage: 15.3 MB

Code 码

Option 1 - after reading 选项1-阅读后

Convert a sample list of indices to an Index object and slice the loaded DataFrame : 将索引的样本列表转换为Index对象，并对加载的DataFrame切片：

idxs = pd.Index(indices)   
subset = df.iloc[idxs, :]
print(subset)

The .iat and .at methods are even faster, but require scalar indices. 该.iat和.at方法会更快，但需要标量指标。

Option 2 - while reading (Recommended) 选项2-阅读时 （推荐）

We can write a predicate that keeps selected indices as the file is being read (more efficient): 我们可以编写一个谓词，以在读取文件时保留选定的索引（效率更高）：

pred = lambda x: x not in indices
data = pd.read_csv("test.csv", skiprows=pred, index_col=0, names="ABCD")
print(data)

See also the issue that led to extending skiprows . 另请参见导致延长skiprows的问题。

Results 结果

The same output is produced from the latter options: 后面的选项产生相同的输出：

        A   B   C   D
1      74  95  28   4
2      87   3  49  94
3      53  54  34  97
10     58  41  48  15
20     86  20  92  11
30     36  59  22   5
67     49  23  86  63
78     98  63  60  75
900    26  11  71  85
2176   12  73  58  91
78776  42  30  97  96

在python中从大型数据帧中快速采样大量行

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-09-24 15:06:56

在python中从大型数据帧中快速采样大量行

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-09-24 15:06:56

解决方案1
2 已采纳 2016-09-24 15:06:56