简体   繁体   English

Dask dataframe saving to_csv for incremental data - 高效写入csv

[英]Dask dataframe saving to_csv for incremental data - Effecient Writing to csv

I have an existing code for reading streaming data and storing it using pandas DataFrame (new data comes in every 5 mins), I then capture this Data Category wise (~350 categories).我有一个现有代码用于读取流数据并使用pandas DataFrame 存储它(新数据每 5 分钟出现一次),然后我明智地捕获此数据类别(~350 个类别)。

Next, I write all the new data (as this is to be incrementally stored) using to_csv in a loop.接下来,我在循环中使用to_csv写入所有数据(因为这是增量存储)。

The Pseudocode is given below:伪代码如下:

    for row in parentdf.itertuples(): #insert into <tbl> .
        mycat = row.category # this is the ONLY parameter which is passed to the Key function below.
        try:
            df = FnforExtractingNParsingData(mycat ,NumericParam1,NumericParam1) 
    
            df.insert(0,'NewCol',sym)
            df = df.assign(calculatedCol = functions1(params))
            df = df.assign(calculatedCol1 = functions2(params),20))
            df = df.assign(calculatedCol3 = functions3(More params),20))
            df[20:].to_csv(outfile, mode='a', header=False, index=False)

The category-wise reading and storing in csv takes 2 Mins-Per cycle *. csv 中的类别读取和存储需要2 分钟-每个周期*。 This is close to .34 Seconds for each writing of the 350 Categories incrementally .对于 350 个类别的增量写入,这接近 0.34 秒 I am wondering whether I can make the above process faster & efficient by using dask dataframes.我想知道我是否可以通过使用dask数据帧使上述过程更快、更高效。

I looked up dask.org and didn't get any clear answers, looked at the use cases as well.我查阅了 dask.org 并没有得到任何明确的答案,我也查看了用例。

Additional details: I am using Python 3.7 and Pandas 0.25 , Further the above code above doesn't return any errors, even though we have completed good amount of Exception handling already on the above.其他详细信息:我正在使用 Python 3.7 和Pandas 0.25 ,而且上面的代码不会返回任何错误,即使我们已经在上面完成了大量的异常处理。 My key function ie FnforExtractingNParsingData is fairly resilient and is working as desired for a long time.我的密钥 function 即FnforExtractingNParsingData相当有弹性,并且可以按预期工作很长时间。

Sounds like you're reading data into a Pandas DataFrame every 5 minutes and then writing it to disk.听起来您每 5 分钟将数据读入 Pandas DataFrame,然后将其写入磁盘。 The question doesn't mention some key facts:这个问题没有提到一些关键事实:

  • how much data is ingested every 5 minutes (10MB or 10TB)?每 5 分钟提取多少数据(10MB 或 10TB)?
  • where is the code being executed (AWS Lambda or a big cluster of machines)?正在执行的代码在哪里(AWS Lambda 或一大群机器)?
  • what data operations does FnforExtractingNParsingData perform? FnforExtractingNParsingData执行哪些数据操作?

Dask DataFrames can be written to disk as multiple CSV files in parallel, which can be a lot faster than writing a single file with Pandas, but it depends. Dask DataFrame 可以作为多个 CSV 文件并行写入磁盘,这比使用 Pandas 写入单个文件要快很多,但这要视情况而定。 Dask is overkill for a tiny dataset. Dask 对于一个很小的数据集来说有点矫枉过正。 Dask can leverage all the CPUs of a single machine, so it can scale up on a single machine better than most people realize. Dask 可以利用一台机器的所有 CPU,因此它可以比大多数人意识到的更好地在一台机器上进行扩展。 For large datasets, Dask will help a lot.对于大型数据集,Dask 会有很大帮助。 Feel free to provide more details in your question and I can give more specific suggestions.请随时在您的问题中提供更多详细信息,我可以提供更具体的建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM