简体   繁体   English

在.csv 中读取和扩充(复制样本并更改某些值)大型数据集的最有效方法是什么

[英]What is the most efficient way to read and augment (copy samples and change some values) large dataset in .csv

Currently, I have managed to solve this but it is slower than what I need.目前,我已经设法解决了这个问题,但它比我需要的要慢。 It takes approximately: 1 hour for 500k samples, the entire dataset is ~100M samples, which requires ~200hours for 100M samples.大约需要:500k 个样本需要 1 小时,整个数据集是 ~100M 样本,100M 样本需要 ~200 小时。

Hardware/Software specs: RAM 8GB, Windows 11 64bit, Python 3.8.8硬件/软件规格:RAM 8GB,Windows 11 64bit,Python 3.8.8

The problem:问题:
I have a dataset in.csv (~13GB) where each sample has a value and a respective start-end period of few months.I want to create a dataset where each sample will have the same value but referring to each specific month.我在.csv (~13GB) 中有一个数据集,其中每个样本都有一个值和几个月的相应起始期。我想创建一个数据集,其中每个样本都具有相同的值,但指的是每个特定的月份。

For example:例如:

from:从:

idx |编号 | start date |开始日期 | end date |结束日期 | month |月 | year |年份 | value价值
0 | 0 | 20/05/2022 | 2022 年 5 月 20 日 | 20/07/2022 | 20/07/2022 | 0 | 0 | 0 | 0 | X X

to:至:

0 | 0 | 20/05/2022 | 2022 年 5 月 20 日 | 20/07/2022 | 20/07/2022 | 5 | 5 | 2022 | 2022 | X X
1 | 1 | 20/05/2022 | 2022 年 5 月 20 日 | 20/07/2022 | 20/07/2022 | 6 | 6 | 2022 | 2022 | X X
2 | 2 | 20/05/2022 | 2022 年 5 月 20 日 | 20/07/2022 | 20/07/2022 | 7 | 7 | 2022 | 2022 | X X

Ideas: Manage to do it parallel (like Dask, but I am not sure how for this task).想法:设法并行执行(例如 Dask,但我不确定如何执行此任务)。

My implementation:我的实现:
Chunk read in pandas, augment in dictionaries, append to CSV.在 pandas 中读取块,在字典中扩充,从 append 到 CSV。 Use a function that, given a df, calculates for each sample the months from start date to end date and creates a copy sample for each month appending it to a dictionary.使用 function,给定一个 df,为每个样本计算从开始日期到结束日期的月份,并为每个月份创建一个副本样本,并将其附加到字典中。 Then it returns the final dictionary.然后它返回最终的字典。

The calculations are done in dictionaries as they were found to be way faster than doing it in pandas.计算是在字典中完成的,因为发现它们比在 pandas 中进行计算要快得多。 Then I iterate through the original CSV in chunks and apply the function at each chunk appending the resulting augmented df to another csv.然后,我以块的形式遍历原始 CSV,并在每个块上应用 function,将生成的增强 df 附加到另一个 csv。

The function: function:

def augment_to_monthly_dict(chunk):
    '''
    Function takes a df or subdf  data and creates and returns an Augmented dataset with monthly data in 
    Dictionary form (for efficiency)
    '''
    dict={}
    l=1
    for i in range(len(chunk)):#iterate through every sample
        # print(str(chunk.iloc[i].APO)[4:6] )  
        #Find the months and years period
        mst =int(float((str(chunk.iloc[i].start)[4:6])))#start month
        mend=int(str(chunk.iloc[i].end)[4:6]) #end month
        yst =int(str(chunk.iloc[i].start)[:4] )#start year
        yend=int(str(chunk.iloc[i].end)[:4] )#end year

        if yend==yst:
            months=[ m for m in range(mst,mend+1)]   
            years=[yend for i in range(len(months))]         
        elif yend==yst+1:# year change at same sample
            months=[m for m in range(mst,13)]
            years=[yst for i in range(mst,13)]
            months= months+[m for m in range(1, mend+1)]
            years= years+[yend for i in range(1, mend+1)]
        else:
            continue
        #months is a list of each month in the period of the sample and years is a same 
        #length list of the respective years eg months=[11,12,1,2] , years= 
        #[2021,2022,2022,2022]

        for j in range(len(months)):#iterate through list of months
            #copy the original sample make it a dictionary
            tmp=pd.DataFrame(chunk.iloc[i]).transpose().to_dict(orient='records')

            #change the month and year values accordingly (they were 0 for initiation)

            tmp[0]['month'] = months[j]
            tmp[0]['year'] = years[j]
            # Here could add more calcs e.g. drop irrelevant columns, change datatypes etc 
            #to reduce size
            #
            #-------------------------------------
            #Append new row to the Augmented data
            dict[l] = tmp[0]
            l+=1
    return dict

Reading the original dataset (.csv ~13GB), augment using the function and append result to new.csv:读取原始数据集 (.csv ~13GB),使用 function 和 append 结果扩充到 new.Z63271CB567AAFE82:FFEFE84

chunk_count=0
for chunk in pd.read_csv('enc_star_logar_ek.csv', delimiter=';', chunksize=10000):

  chunk.index = chunk.reset_index().index

  aug_dict = augment_to_monthly_dict(chunk)#make chunk dictionary to work faster
  chunk_count+=1  

  if chunk_count ==1: #get the column names and open csv write headers and 1st chunk

       #Find the dicts keys, the column names only from the first dict(not reading all data)
       for kk in aug_dict.values():
            key_names = [i for i in kk.keys()] 
            print(key_names)
            break #break after first input dict

       #Open csv file and write ';' separated data
       with open('dic_to_csv2.csv', 'w', newline='') as csvfile:
            writer = csv.DictWriter(csvfile,delimiter=';', fieldnames=key_names)
            writer.writeheader()
            writer.writerows(aug_dict.values())

  else: # Save the rest of the data chunks
       print('added chunk: ', chunk_count)
       with open('dic_to_csv2.csv', 'a', newline='') as csvfile:
            writer = csv.DictWriter(csvfile,delimiter=';', fieldnames=key_names)
            writer.writerows(aug_dict.values())

I suggest you to use pandas (or even dask) to return the list of months between two columns of a huge dataset (eg, .csv ~13GB).我建议您使用 pandas(甚至 dask)来返回巨大数据集的两列之间的月份列表(例如,.csv ~13GB)。 First you need to convert your two columns to a datetime by using pandas.to_datetime .首先,您需要使用pandas.to_datetime将两列转换为日期时间。 Then, you can use pandas.date_range to get your list.然后,您可以使用pandas.date_range来获取您的列表。

Try with this:试试这个:

import pandas as pd
from io import StringIO

s = """start date   end date    month   year    value
20/05/2022  20/07/2022  0   0   X
"""

df = pd.read_csv(StringIO(s), sep='\t')

df['start date'] = pd.to_datetime(df['start date'], format = "%d/%m/%Y")
df['end date'] = pd.to_datetime(df['end date'], format = "%d/%m/%Y")

df["month"] = df.apply(lambda x: pd.date_range(start=x["start date"], end=x["end date"] + pd.DateOffset(months=1), freq="M").month.tolist(), axis=1)
df['year'] = df['start date'].dt.year

out = df.explode('month').reset_index(drop=True)

>>> print(out)

  start date   end date month  year value
0 2022-05-20 2022-07-20     5  2022     X
1 2022-05-20 2022-07-20     6  2022     X
2 2022-05-20 2022-07-20     7  2022     X

Note : I tested the code above on a 1 million.csv dataset and it took ~10min to get the output.注意我在 100 万个数据集上测试了上面的代码。csv 数据集需要大约 10 分钟才能得到 output。

you can read very large csv file with dask , then process it (same api as pandas), then convert it to pandas dataframe if you need. you can read very large csv file with dask , then process it (same api as pandas), then convert it to pandas dataframe if you need. dask is perfect when pandas fails due to data size or computation speed.当 pandas 由于数据大小或计算速度而失败时, dask是完美的。 But for data that fits into RAM, pandas can often be faster and easier to use than Dask DataFrame.但对于适合 RAM 的数据,pandas 通常比 Dask DataFrame 更快、更容易使用。

import dask.dataframe as dd

#1. read the large csv

dff = dd.read_csv('path_to_big_csv_file.csv') #return Dask.DataFrame

#if still not enough, try more reducing IO costs:
dff = dd.read_csv('largefile.csv', blocksize=25e6) #use blocksize (number of bytes by which to cut up larger files)
dff = dd.read_csv('largefile.csv', columns=["a", "b", "c"]) #return only columns a, b and c

#2. work with dff, dask has the same api than pandas:
#https://docs.dask.org/en/stable/dataframe-api.html

#3. then, finally, convert dff to pandas dataframe if you want
df = dff.compute() #return pandas dataframe

you can also try other alternatives for reading very large csv files efficiently with high speed & low momory usage: pola , modin , koalas .您还可以尝试其他替代方案,以高速和低内存使用率有效读取非常大的 csv 文件: polamodinkoalas all those packages, same as dask, use similar api as pandas .所有这些软件包,与 dask 相同,使用与 pandas 类似的 api

if you have very big csv file, pandas read_csv with chunksize usually don't succeed, and even if if succeed, it will be waist of time and energy如果你有非常大的 csv 文件, pandas read_csv with chunksize通常不会成功,即使成功,也会浪费时间和精力

Pandas efficiency comes in to play when you need to manipulate columns of data, and to do that Pandas reads the input row-by-row building up a series of data for each column;当您需要操作数据时,Pandas 效率就会发挥作用,为此,Pandas 会逐行读取输入,为每一列构建一系列数据; that's a lot of extra computation your problem doesn't benefit from, and in fact just slows your solution down.这是很多额外的计算,您的问题并没有从中受益,实际上只会减慢您的解决方案。

You actually need to manipulate rows , and for that the fastest way is to use the standard csv module;您实际上需要操作rows ,为此最快的方法是使用标准 csv 模块; all you need to do is read a row in, write the derived rows out, and repeat:您需要做的就是读入一行,写出派生行,然后重复:

import csv
import sys

from datetime import datetime


def parse_dt(s):
    return datetime.strptime(s, r"%d/%m/%Y")


def get_dt_range(beg_dt, end_dt):
    """
    Returns a range of (month, year) tuples, from beg_dt up-to-and-including end_dt.
    """
    if end_dt < beg_dt:
        raise ValueError(f"end {end_dt} is before beg {beg_dt}")

    mo, yr = beg_dt.month, beg_dt.year

    dt_range = []
    while True:
        dt_range.append((mo, yr))
        if mo == 12:
            mo = 1
            yr = yr + 1
        else:
            mo += 1
        if (yr, mo) > (end_dt.year, end_dt.month):
            break

    return dt_range


fname = sys.argv[1]
with open(fname, newline="") as f_in, open("output_csv.csv", "w", newline="") as f_out:
    reader = csv.reader(f_in)
    writer = csv.writer(f_out)
    writer.writerow(next(reader))  # transfer header

    for row in reader:
        beg_dt = parse_dt(row[1])
        end_dt = parse_dt(row[2])
        for mo, yr in get_dt_range(beg_dt, end_dt):
            row[3] = mo
            row[4] = yr
            writer.writerow(row)

And, to compare with Pandas in general, let's examine @abokey's specifc Pandas solution—I'm not sure if there is a better Pandas implementation, but this one kinda does the right thing:而且,为了与一般的 Pandas 进行比较,让我们检查 @abokey 的特定 Pandas 解决方案—我不确定是否有更好的 Pandas 实现,但这种做法是正确的:

import sys
import pandas as pd

fname = sys.argv[1]
df = pd.read_csv(fname)

df["start date"] = pd.to_datetime(df["start date"], format="%d/%m/%Y")
df["end date"] = pd.to_datetime(df["end date"], format="%d/%m/%Y")

df["month"] = df.apply(
    lambda x: pd.date_range(
        start=x["start date"], end=x["end date"] + pd.DateOffset(months=1), freq="M"
    ).month.tolist(),
    axis=1,
)
df["year"] = df["start date"].dt.year

out = df.explode("month").reset_index(drop=True)

out.to_csv("output_pd.csv")

Let's start with the basics, though, do the programs actually do the right thing.不过,让我们从基础开始,程序是否真的做正确的事情。 Given this input:鉴于此输入:

idx,start date,end date,month,year,value
0,20/05/2022,20/05/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/12/2022,20/01/2023,0,0,X

My program, ./main.py input.csv , produces:我的程序./main.py input.csv产生:

idx,start date,end date,month,year,value
0,20/05/2022,20/05/2022,5,2022,X
0,20/05/2022,20/07/2022,5,2022,X
0,20/05/2022,20/07/2022,6,2022,X
0,20/05/2022,20/07/2022,7,2022,X
0,20/12/2022,20/01/2023,12,2022,X
0,20/12/2022,20/01/2023,1,2023,X

I believe that's what you're looking for.我相信这就是你要找的。

The Pandas solution, ./main_pd.py input.csv , produces: Pandas 解决方案./main_pd.py input.csv产生:

,idx,start date,end date,month,year,value
0,0,2022-05-20,2022-05-20,5,2022,X
1,0,2022-05-20,2022-07-20,5,2022,X
2,0,2022-05-20,2022-07-20,6,2022,X
3,0,2022-05-20,2022-07-20,7,2022,X
4,0,2022-12-20,2023-01-20,12,2022,X
5,0,2022-12-20,2023-01-20,1,2022,X

Ignoring the added column for the frame index, and the fact the date format has been changed (I'm pretty sure that can be fixed with some Pandas directive I don't know), it still does the right thing with regards to creating new rows with the appropriate date range.忽略为帧索引添加的列,以及日期格式已更改的事实(我很确定可以用一些我不知道的 Pandas 指令修复),它仍然在创建新的具有适当日期范围的行。

So, both do the right thing.所以,两者都做正确的事。 Now, on to performance.现在,开始表演。 I duplicated your initial sample, just the 1 row, for 1_000_000 and 10_000_000 rows:我为 1_000_000 和 10_000_000 行复制了您的初始样本,仅 1 行:

import sys

nrows = int(sys.argv[1])
with open(f"input_{nrows}.csv", "w") as f:
    f.write("idx,start date,end date,month,year,value\n")
    for _ in range(nrows):
        f.write("0,20/05/2022,20/07/2022,0,0,X\n")

I'm running a 2020, M1 MacBook Air with the 2TB SSD (which gives very good read/write speeds):我正在运行带有 2TB SSD 的 2020 M1 MacBook Air(它提供了非常好的读/写速度):

1M rows (sec, RAM) 1M 行(秒,RAM) 10M rows (sec, RAM) 10M 行(秒,RAM)
csv module csv模块 7.8s, 6MB 7.8 秒,6MB 78s, 6MB 78s, 6MB
Pandas Pandas 75s, 569MB 75 秒,569MB 750s, 5.8GB 750 年代,5.8GB

You can see both programs following a linear increase in time-to-run that follows the increase in the size of rows.您可以看到这两个程序在运行时间随着行大小的增加而线性增加。 The csv module's memory remains constanly non-existent because it's streaming data in-and-out (holding on to virtually nothing); csv 模块的 memory 始终不存在,因为它是进出流数据(几乎没有保留); Pandas's memory rises with the size of rows it has to hold so that it can do the actual date-range computations, again on whole columns . Pandas 的 memory 随着它必须容纳的行的大小而增加,以便它可以进行实际的日期范围计算,同样在整个列上。 Also, not shown, but for the 10M-rows Pandas test, Pandas spent nearly 2 minutes just writing the CSV—longer than the csv-module approach took to complete the entire task.此外,未显示,但对于 10M 行 Pandas 测试,Pandas 仅花费了近 2 分钟来编写 CSV,比 csv 模块方法完成整个任务所需的时间还要长。

Now, for all my putting-down of Pandas, the solution is far fewer lines, and is probably bug free from the get-go.现在,对于我所有的 Pandas,解决方案的行数要少得多,并且可能从一开始就没有错误。 I did have a problem writing get_dt_range(), and had to spend about 5 minutes thinking about what it actually needed to do and debug it.我确实在编写 get_dt_range() 时遇到了问题,不得不花大约 5 分钟时间思考它实际上需要做什么并对其进行调试。

You can view my setup with the small test harness, and the results,here .您可以在此处使用小型测试工具查看我的设置和结果。

There's a Table helper in convtools library (I must confess, a lib of mine). convtools库中有一个Table助手(我必须承认,我的一个库)。 This helper processes csv files as a stream, using simple csv.reader under the hood:这个助手使用简单的csv.reader将 csv 文件作为 stream 处理:

from datetime import datetime

from convtools import conversion as c
from convtools.contrib.tables import Table


def dt_range_to_months(dt_start, dt_end):
    return tuple(
        (year_month // 12, year_month % 12 + 1)
        for year_month in range(
            dt_start.year * 12 + dt_start.month - 1,
            dt_end.year * 12 + dt_end.month,
        )
    )


(
    Table.from_csv("tmp/in.csv", header=True)
    .update(
        year_month=c.call_func(
            dt_range_to_months,
            c.call_func(datetime.strptime, c.col("start date"), "%d/%m/%Y"),
            c.call_func(datetime.strptime, c.col("end date"), "%d/%m/%Y"),
        )
    )
    .explode("year_month")
    .update(
        year=c.col("year_month").item(0),
        month=c.col("year_month").item(1),
    )
    .drop("year_month")
    .into_csv("tmp/out.csv")
)

On my M1 Mac on a file where each row explodes into three it processes 100K of rows per second.在我的 M1 Mac 上,每行分解为三行的文件每秒处理 100K 行。 In case of 100M rows of the same structure it should take ~ 1000s (< 17 min).如果有 100M 行相同的结构,它应该需要大约 1000 秒(< 17 分钟)。 Of course it depends on how deep inner by-month cycles are.当然,这取决于内部的逐月周期有多深。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 读取大型二进制文件python的最有效方法是什么 - What is the most efficient way to read a large binary file python 计算大协方差矩阵的特征值的最有效方法是什么? - What is the most efficient way to calculate the eigen values of a large covariance matrix? 使用 Python 读取位于 S3 (AWS) 上的大型 CSV 文件(10 M+ 条记录)的最有效方法是什么? - What is the most efficient way to read a large CSV file ( 10 M+ records) located on S3 (AWS) with Python? 在python中解析大型.csv的最有效方法? - Most efficient way to parse a large .csv in python? 更改数据框列的特定值,最有效的方法是什么? - Change specific values of dataframe columns, what is the most efficient way? 在Python 3中,更改列表中值的最有效方法是什么? - In Python 3, what is the most efficient way to change the values in a list? 细分大型列表的最有效方法是什么? - What is the most efficient way to subdivide a large list? 我有一个CSV,想用另一个CSV中的值更新它。 最有效的方法是什么? - I have a CSV and want to update it with values from another CSV. What is the most efficient way to do this? Python:从大型数据集创建新的csv的有效方法 - Python: efficient way to create new csv from large dataset python中有哪些有效的数据结构来存储和处理大型数据集? - What are some efficient data structures in python to store and process large dataset?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM