简体   繁体   English

在熊猫中读取大型csv的最后N行

[英]Reading last N rows of a large csv in Pandas

I have file with 50 GB data. 我有50 GB数据的文件。 I know how to use Pandas for my data analysis. 我知道如何使用Pandas进行数据分析。
I am only in need of the large 1000 lines or rows and in need of complete 50 GB. 我只需要大1000行或行,就需要完整的50 GB。
Hence, I thought of using the nrows option in the read_csv() . 因此,我想到了在read_csv()中使用nrows选项。
I have written the code like this: 我写了这样的代码:

import pandas as pd
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=1000,index_col=0)

But it has taken the top 1000 rows. 但是它排在前1000名。 I am in need of the last 100 rows. 我需要最后100行。 So I did this and received error: 所以我这样做,并收到错误:

df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=-1000,index_col=0)
ValueError: 'nrows' must be an integer >=0

I have even tried using the chunksize in the read_csv() . 我什至尝试在read_csv()使用chunksize But it still loads the complete file. 但是它仍然加载完整的文件。 And even the output was not DataFrame but iterables . 甚至输出的不是DataFrame而是iterables

Hence, please let me know what I can in this scenario. 因此,请让我知道在这种情况下可以做什么。

Please NOTE THAT I DO NOT WANT TO OPEN THE COMPLETE FILE... 请注意,我不想打开完整文件...

A pure pandas method: 一种纯熊猫方法:

import pandas as pd
line = 0
chksz = 1000
for chunk in pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",chunksize = chksz,index_col=0, usecols=0):
    line += chunk.shape[0]

So this just counts the the number of rows, we read just the first column for performance reasons. 因此,这只是计算行数,出于性能原因,我们仅读取第一列。

Once we have the total number of rows we just subtract from this the number of rows we want from the end: 一旦有了总行数,我们就可以从结尾减去我们想要的行数:

df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16", skiprows = line - 1000,index_col=0)

I think you need to use skiprows and nrows together. 我认为您需要同时使用skip行和n行。 Assuming that your file has 1000 rows, then, 假设您的文件有1000行,

df =pd.read_csv('"Analysis_of_50GB.csv", encoding="utf16",skiprows = lambda x: 0<x<=900, nrows=1000-900,index_col=0)

reads all the rows from 901 to 1000. 读取从901到1000的所有行。

You should consider using dask which does chunking under the hood and allows you to work with very large data frames. 您应该考虑使用dask ,它可以在后台进行分块,并允许您处理非常大的数据帧。 It has a very similar workflow as pandas and the most important functions are already implemented. 它的工作流程与熊猫非常相似,最重要的功能已经实现。

The normal way would be to read the whole file and keep 1000 lines in a dequeue as suggested in the accepted answer to Efficiently Read last 'n' rows of CSV into DataFrame . 通常的方法是读取整个文件,并按照公认的将CSV的最后“ n”行有效读入DataFrame的答案中的建议,在出队中保留1000行。 But it may be suboptimal for a really huge file of 50GB. 但是对于50GB的巨大文件来说,它可能不是最佳选择。

In that case I would try a simple pre-processing: 在那种情况下,我将尝试一个简单的预处理:

  • open the file 打开文件
  • read and discard 1000 lines 读取和丢弃1000行
  • use ftell to have an approximation of what has been read so far 使用ftell获得到目前为止已读过的内容的近似值
  • seek that size from the end of the file and read the end of file in a large buffer (if you have enough memory) 从文件末尾查找该大小,然后在大缓冲区中读取文件末尾(如果有足够的内存)
  • store the positions of the '\\n' characters in the buffer in a dequeue of size 1001 (the file has probably a terminal '\\n'), let us call it deq 以大小为1001的出队方式将'\\ n'字符的位置存储在缓冲区中(文件可能具有终端'\\ n'),我们将其deq
  • ensure that you have 1001 newlines, else iterate with a larger offset 确保您有1001个换行符,否则以较大的偏移量进行迭代
  • load the dataframe with the 1000 lines contained in the buffer: 用缓冲区中包含的1000行加载数据帧:

     df = pd.read_csv(io.StringIO(buffer[d[0]+1:])) 

Code could be (beware: untested): 代码可能是(提防:未经测试):

with open("Analysis_of_50GB.csv", "r", encoding="utf-16") as fd:
    for i in itertools.islice(fd, 1250):      # read a bit more...
        pass
    offset = fd.tell()
    while(True):
        fd.seek(-offset, os.SEEK_END)
        deq = collection.deque(maxlen = 1001)
        buffer = fd.read()
        for i,c in enumerate(buffer):
            if c == '\n':
                deq.append(i)
        if len(deq) == 1001:
            break
        offset = offset * 1250 // len(deq)

df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM