简体   繁体   English

仅使用python读取csv文件中的某些行

[英]Only read certain rows in a csv file with python

I want to read only a certain amount of rows starting from a certain row in a csv file without iterating over the whole csv file to reach this certain point.我只想从 csv 文件中的某一行开始读取一定数量的行,而不是遍历整个 csv 文件以达到这个特定点。

Lets say i have a csv file with 100 rows and i want to read only row 50 to 60. I dont want to iterate from row 1 to 49 to reach row 50 to start reading.假设我有一个包含 100 行的 csv 文件,我只想读取第 50 行到第 60 行。我不想从第 1 行到第 49 行迭代以到达第 50 行以开始读取。 Can i somehow achieve this with seek()?我可以用seek()以某种方式实现这一目标吗?

For example: Seek to row 50 read from 50 to 60例如:Seek to row 50 read from 50 to 60

next time: seek to row 27 read 27 to 34 and so on下一次:寻求第 27 行读取 27 到 34,依此类推

So not only seeking continuesly forward through the file but also backwards.因此,不仅在文件中继续向前查找,而且还向后查找。

Thank you a lot非常感谢

An option would be to use Pandas.一种选择是使用 Pandas。 For example:例如:

import pandas as pd
# Select file 
infile = r'path/file'
# Use skiprows to choose starting point and nrows to choose number of rows
data = pd.read_csv(infile, skiprows = 50, nrows=10)

You can use chunksize您可以使用CHUNKSIZE

import pandas as pd

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

As others are saying the most obvious solution is to use pandas read csv !正如其他人所说,最明显的解决方案是使用 pandas read csv ! The method has a parameter called skiprows:该方法有一个名为 skiprows 的参数:

from the doc there is what is said : 从文档中可以看出

skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. skiprows : list-like, int 或 callable, 可选的要跳过的行号(0-indexed)或要跳过的行数(int)在文件的开头。

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise.如果可调用,则可调用函数将根据行索引进行评估,如果应跳过该行则返回 True,否则返回 False。 An example of a valid callable argument would be lambda x: x in [0, 2].一个有效的可调用参数的例子是 lambda x: x in [0, 2]。

You can have something like this :你可以有这样的事情:

import pandas as pd
data = pd.read_csv('path/to/your/file', skiprows =lambda x: x not in range(50, 60))

Since you specify that the memory is your problem you can use the chunksize parameter as said in this tutorial由于您指定内存是您的问题,因此您可以使用本教程中所述的 chunksize 参数

he said :他说 :

The parameter essentially means the number of rows to be read into a dataframe at any single time in order to fit into the local memory.该参数本质上是指在任何单个时间要读入数据帧以适应本地内存的行数。 Since the data consists of more than 70 millions of rows, I specified the chunksize as 1 million rows each time that broke the large data set into many smaller pieces.由于数据由超过 7000 万行组成,我将 chunksize 指定为每次 100 万行,这将大数据集分成许多更小的部分。

df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)

You can try this and iterate over the chunk to retrieve only the rows you are looking for.您可以尝试此操作并遍历块以仅检索您要查找的行。

The function should return true if the row number is in the specified list如果行号在指定的列表中,函数应该返回 true

If the # of columns/line lengths are variable, it isn't possible to find the line you want without "reading" (ie, processing) every character of the file that comes before that, and counting the line terminators.如果列数/行长度是可变的,则不可能在不“读取”(即处理)文件中出现在其之前的每个字符并计算行终止符的情况下找到所需的行。 And the fastest way to process them in python, is to use iteration.在 python 中处理它们的最快方法是使用迭代。

As to the fastest way to do that with a large file, I do not know whether it is faster to iterate by line this way:至于处理大文件的最快方法,我不知道以这种方式逐行迭代是否更快:

with open(file_name) as f:
    for line,_ in zip(f, range(50)):
        pass
    lines = [line for line,_ in zip(f, range(10))]

...or to read a character at a time using seek , and count new line characters. ...或使用seek一次读取一个字符,并计算新行字符。 But it is certainly MUCH more convenient to do the first.但是做第一个肯定要方便得多。

However if the file gets read a lot, iterating over the lines will be slow over time.但是,如果文件被大量读取,随着时间的推移,遍历行会变慢。 If the file contents do not change, you could instead accomplish this by reading the whole thing once and building a dict of the line lengths ahead of time:如果文件内容没有改变,你可以通过阅读整个内容并提前构建行长度的dict来实现这一点:

from itertools import accumulate
with open(file_name) as f:
    cum_lens = dict(enumerate(accumulate(len(line) for line in f), 1))

This would allow you to seek to any line number in the file without processing the whole thing ever again:这将允许您查找文件中的任何行号,而无需再次处理整个事情:

def seek_line(path, line_num, cum_lens):
    with open(path) as f:
        f.seek(cum_lens[line_num], 0)
        return f.readline()

class LineX:
    """A file reading object that can quickly obtain any line number."""
    def __init__(self, path, cum_lens):
        self.cum_lens = cum_lens
        self.path = path
    def __getitem__(self, i):
        return seek_line(self.path, i, self.cum_lens)

linex = LineX(file_name, cum_lens)
line50 = linex[50]

But at this point, you might be better off loading the file contents into some kind of database.但此时,最好将文件内容加载到某种数据库中。 I depends on what you're trying to do, and what kind of data the file contains.我取决于你想做什么,以及文件包含什么样的数据。

its that easy:就这么简单:

with open("file.csv", "r") as file:
    print(file.readlines()[50:60])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM