简体   繁体   English

如何在python上读取csv,tp得到一个dataframe,但是每3行只有一行?

[英]How to read csv on python, tp get a dataframe, but only one row every 3 rows?

I have a very massive csv file.我有一个非常大的 csv 文件。 I would like to get one row, every 3 rows, in a dataframe. It is more or less like resampling the csv.我想在 dataframe 中每 3 行获取一行。这或多或少类似于对 csv 重新采样。

Let's say, I have a csv file like this:比方说,我有一个这样的 csv 文件:

4  5
9  2
3  7
1  5
2  4
9  10

And I want my dataframe to be:我希望我的 dataframe 是:

4  5
1  5

If I read the csv and then drop 1 row every 3 rows, it is useless because it is taking too much time.如果我读取 csv 然后每 3 行删除 1 行,它是无用的,因为它花费了太多时间。 Does someone have an idea?有人有想法吗? :) (By the way, I am using Python) :)(顺便说一下,我正在使用 Python)

Cheers干杯

If I understood correctly, you want to cut your read time to (at most) 1/3 of the total time.如果我理解正确的话,您想将阅读时间减少到(最多)总时间的 1/3。 Pandas has many function to customize your csv read, but none will avoid reading (despite then discarding) your whole file, since it is stored on contiguous blocks on your disk. Pandas 有许多 function 来自定义您的 csv 读取,但没有一个会避免读取(尽管随后丢弃)您的整个文件,因为它存储在磁盘上的连续块上。

What I think is that if your constraint is time (and not memory), a 1/3 reduction of the time is not going to be enough in any case, of any size of your file.我的想法是,如果您的约束是时间(而不是内存),那么在任何情况下,无论您的文件大小如何,减少 1/3 的时间都是不够的。 What you can do is:你可以做的是:

  • read the whole csv阅读全文 csv
  • filter it keeping just 1 row each 3过滤它,每 3 只保留 1 行
  • store the result in an other file将结果存储在另一个文件中
  • on following runs, read the filtered csv在接下来的运行中,读取过滤后的 csv

You need to create a csv reader object first, then create a generator which will read only nth item from the iterator, then use it as dataframe source.您需要先创建一个 csv 读取器 object,然后创建一个生成器,它将只从迭代器中读取第 n 个项目,然后将其用作 dataframe 源。 By doing it in that way you will avoid excessive memory usage.通过这样做,您将避免过度使用 memory。

import csv
import pandas as pd

with open('file.csv', newline='') as f:
    reader = csv.reader(f)
    data = (x for i, x in enumerate(reader) if i % 3 == 0)
df = pd.Dataframe(data)

It looks like there is also a simpler way: passing lambda to skiprows argument of read_csv看起来还有一种更简单的方法:将 lambda 传递给 read_csv 的 skiprows 参数

import pandas as pd

fn = lambda x: x % 3 != 0
df = pd.read_csv('file.csv', skiprows=fn)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM