Dask dataframe saving to_csv for incremental data - 高效写入csv

Question

I have an existing code for reading streaming data and storing it using pandas DataFrame (new data comes in every 5 mins), I then capture this Data Category wise (~350 categories).我有一个现有代码用于读取流数据并使用pandas DataFrame 存储它（新数据每 5 分钟出现一次），然后我明智地捕获此数据类别（~350 个类别）。

Next, I write all the new data (as this is to be incrementally stored) using to_csv in a loop.接下来，我在循环中使用to_csv写入所有新数据（因为这是增量存储）。

The Pseudocode is given below:伪代码如下：

    for row in parentdf.itertuples(): #insert into <tbl> .
        mycat = row.category # this is the ONLY parameter which is passed to the Key function below.
        try:
            df = FnforExtractingNParsingData(mycat ,NumericParam1,NumericParam1) 
    
            df.insert(0,'NewCol',sym)
            df = df.assign(calculatedCol = functions1(params))
            df = df.assign(calculatedCol1 = functions2(params),20))
            df = df.assign(calculatedCol3 = functions3(More params),20))
            df[20:].to_csv(outfile, mode='a', header=False, index=False)

The category-wise reading and storing in csv takes 2 Mins-Per cycle *. csv 中的类别读取和存储需要2 分钟-每个周期*。 This is close to .34 Seconds for each writing of the 350 Categories incrementally .对于 350 个类别的增量写入，这接近 0.34 秒。 I am wondering whether I can make the above process faster & efficient by using dask dataframes.我想知道我是否可以通过使用dask数据帧使上述过程更快、更高效。

I looked up dask.org and didn't get any clear answers, looked at the use cases as well.我查阅了 dask.org 并没有得到任何明确的答案，我也查看了用例。

Additional details: I am using Python 3.7 and Pandas 0.25 , Further the above code above doesn't return any errors, even though we have completed good amount of Exception handling already on the above.其他详细信息：我正在使用 Python 3.7 和Pandas 0.25 ，而且上面的代码不会返回任何错误，即使我们已经在上面完成了大量的异常处理。 My key function ie FnforExtractingNParsingData is fairly resilient and is working as desired for a long time.我的密钥 function 即FnforExtractingNParsingData相当有弹性，并且可以按预期工作很长时间。

Answer 1

Sounds like you're reading data into a Pandas DataFrame every 5 minutes and then writing it to disk.听起来您每 5 分钟将数据读入 Pandas DataFrame，然后将其写入磁盘。 The question doesn't mention some key facts:这个问题没有提到一些关键事实：

how much data is ingested every 5 minutes (10MB or 10TB)?每 5 分钟提取多少数据（10MB 或 10TB）？
where is the code being executed (AWS Lambda or a big cluster of machines)?正在执行的代码在哪里（AWS Lambda 或一大群机器）？
what data operations does FnforExtractingNParsingData perform? FnforExtractingNParsingData执行哪些数据操作？

Dask DataFrames can be written to disk as multiple CSV files in parallel, which can be a lot faster than writing a single file with Pandas, but it depends. Dask DataFrame 可以作为多个 CSV 文件并行写入磁盘，这比使用 Pandas 写入单个文件要快很多，但这要视情况而定。 Dask is overkill for a tiny dataset. Dask 对于一个很小的数据集来说有点矫枉过正。 Dask can leverage all the CPUs of a single machine, so it can scale up on a single machine better than most people realize. Dask 可以利用一台机器的所有 CPU，因此它可以比大多数人意识到的更好地在一台机器上进行扩展。 For large datasets, Dask will help a lot.对于大型数据集，Dask 会有很大帮助。 Feel free to provide more details in your question and I can give more specific suggestions.请随时在您的问题中提供更多详细信息，我可以提供更具体的建议。

Dask dataframe saving to_csv for incremental data - 高效写入csv

问题描述

1 个解决方案

解决方案1
1 2021-10-07 13:33:51

Dask dataframe saving to_csv for incremental data - 高效写入csv

问题描述

1 个解决方案

解决方案1 1 2021-10-07 13:33:51

解决方案1
1 2021-10-07 13:33:51