[英]Quickly sampling large number of rows from large dataframes in python
I have a very large dataframe (about 1.1M rows) and I am trying to sample it. 我有一个非常大的数据框(约110万行),我正在尝试对其进行采样。
I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe. 我有一个索引列表(大约70,000个索引),我想从整个数据框中选择。
This is what Ive tried so far but all these methods are taking way too much time: 我到目前为止已经尝试过了,但是所有这些方法都花费了太多时间:
Method 1 - Using pandas : 方法1-使用熊猫:
sample = pandas.read_csv("data.csv", index_col = 0).reset_index()
sample = sample[sample['Id'].isin(sample_index_array)]
Method 2 : 方法2:
I tried to write all the sampled lines to another csv. 我试图将所有采样行写入另一个csv。
f = open("data.csv",'r')
out = open("sampled_date.csv", 'w')
out.write(f.readline())
while 1:
total += 1
line = f.readline().strip()
if line =='':
break
arr = line.split(",")
if (int(arr[0]) in sample_index_array):
out.write(",".join(e for e in (line)))
Can anyone please suggest a better method? 任何人都可以提出更好的方法吗? Or how I can modify this to make it faster? 或者如何修改它以使其更快?
Thanks 谢谢
We don't have your data, so here is an example with two options: 我们没有您的数据,因此这是一个带有两个选项的示例:
Index
object to select a subset via the .iloc
selection method 阅读后 :使用熊猫Index
对象通过.iloc
选择方法选择一个子集 skiprows
parameter 阅读时 :带有skiprows
参数的谓词 Given 给定
A collection of indices and a (large) sample DataFrame
written to test.csv
: 索引的集合和(大)样本DataFrame
写入test.csv
:
import pandas as pd
import numpy as np
indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776]
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD"))
df.to_csv("test.csv", header=False)
df.info()
Output 输出量
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
A 1000000 non-null int32
B 1000000 non-null int32
C 1000000 non-null int32
D 1000000 non-null int32
dtypes: int32(4)
memory usage: 15.3 MB
Code 码
Option 1 - after reading 选项1-阅读后
Convert a sample list of indices to an Index
object and slice the loaded DataFrame
: 将索引的样本列表转换为Index
对象,并对加载的DataFrame
切片:
idxs = pd.Index(indices)
subset = df.iloc[idxs, :]
print(subset)
The .iat
and .at
methods are even faster, but require scalar indices. 该.iat
和.at
方法会更快,但需要标量指标。
Option 2 - while reading (Recommended) 选项2-阅读时 (推荐)
We can write a predicate that keeps selected indices as the file is being read (more efficient): 我们可以编写一个谓词,以在读取文件时保留选定的索引(效率更高):
pred = lambda x: x not in indices
data = pd.read_csv("test.csv", skiprows=pred, index_col=0, names="ABCD")
print(data)
See also the issue that led to extending skiprows
. 另请参见导致延长skiprows
的问题 。
Results 结果
The same output is produced from the latter options: 后面的选项产生相同的输出:
A B C D
1 74 95 28 4
2 87 3 49 94
3 53 54 34 97
10 58 41 48 15
20 86 20 92 11
30 36 59 22 5
67 49 23 86 63
78 98 63 60 75
900 26 11 71 85
2176 12 73 58 91
78776 42 30 97 96
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.