简体   繁体   English

读取大型csv文件,python,pandas的随机行

[英]Reading random rows of a large csv file, python, pandas

could you help me, I faced a problem of reading random rows from the large csv file using 0.18.1 pandas and 2.7.10 Python on Windows (8 Gb RAM). 你能帮助我吗?我遇到了一个问题,就是在Windows上使用0.18.1 pandas和2.7.10 Python(8 Gb RAM)从大型csv文件中读取随机行。

In Read a small random sample from a big CSV file into a Python data frame I saw an approach, however, it occured for my PC to be very memory consuming, namely, part of the code: 阅读从一个大的CSV文件到Python数据框的一个小的随机样本我看到了一种方法,但是,我的PC出现了非常耗费内存的问题,即代码的一部分:

n = 100
s = 10
skip = sorted(rnd.sample(xrange(1, n), n-s))# skip n-s random rows from *.csv       
data = pd.read_csv(path, usecols = ['Col1', 'Col2'], 
                   dtype  = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip)

so, if I want to take some random rows from the file considering not only 100 rows, but 100 000, it becomes hard, however taking not random rows from the file is almost alright: 所以,如果我想从文件中取一些随机行,不仅考虑100行,而且考虑100 000,它变得很难,但是从文件中取出的随机行几乎没有问题:

skiprows = xrange(100000)    
data = pd.read_csv(path, usecols = ['Col1', 'Col2'], 
                   dtype  = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip, nrows = 10000)

So the question how can I deal with reading large number of random rows from the large csv file with pandas, ie since I can't read the entire csv file, even with chunking it, I'm interested exactly in random rows. 那么问题是如何处理从大型csv文件中读取大量随机行的大熊猫,即因为我无法读取整个csv文件,即使使用分块,我也对随机行感兴趣。 Thanks 谢谢

If memory is the biggest issue, a possible solution might be to use chunks, and randomly select from the chunks 如果内存是最大的问题,可能的解决方案可能是使用块,并从块中随机选择

n = 100
s = 10
factor = 1    # should be integer
chunksize = int(s/factor)

reader = pd.read_csv(path, usecols = ['Col1', 'Col2'],dtype  = {'Col1': 'int32', 'Col2':'int32'}, chunksize=chunksize)

out = []
tot = 0
for df in reader:
    nsample = random.randint(factor,chunksize)
    tot += nsample
    if  tot > s:
        nsample = s - (tot - nsample)
    out.append(df.sample(nsample))
    if tot >= s:
        break

data = pd.concat(out)

And you can use factor to control the sizes of the chunks. 您可以使用因子来控制块的大小。

I think this is faster than other methods showed here and may be worth trying. 我认为这比其他方法更快,可能值得尝试。

Say, we have already chosen rows to be skipped in a list skipped . 再说了,我们已经选择行列表中被跳过skipped First, I convert it to a lookup bool table. 首先,我将其转换为查找bool表。

# Some preparation:
skipped = np.asarray(skipped)
# MAX >= number of rows in the file
bool_skipped = np.zeros(shape(MAX,), dtype=bool)
bool_skipped[skipped] = True

Main stuff: 主要内容:

from io import StringIO
# in Python 2 use
# from StringIO import StringIO

def load_with_buffer(filename, bool_skipped, **kwargs):
    s_buf = StringIO()
    with open(filename) as file:
        count = -1
        for line in file:
            count += 1
            if bool_skipped[count]:
                continue
            s_buf.write(line)
    s_buf.seek(0)
    df = pd.read_csv(s_buf, **kwargs)
    return df

I tested it as follows: 我测试了如下:

df = pd.DataFrame(np.random.rand(100000, 100))
df.to_csv('test.csv')

df1 = load_with_buffer('test.csv', bool_skipped, index_col=0)

with 90% of rows skipped. 有90%的行被跳过。 It performs comparably to 它的表现与之相当

pd.read_csv('test.csv', skiprows=skipped, index_col=0)

and is about 3-4 times faster than using dask or reading in chunks. 并且比使用dask或读取块大约快3-4倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM