pandas 使用並行處理按列值拆分數據幀

Question

我有一個非常大的 pandas dataframe，我正在嘗試按股票名稱將它分成多個，並將它們保存到 csv。

 stock     date     time   spread  time_diff 
  VOD      01-01    9:05    0.01     0:07     
  VOD      01-01    9:12    0.03     0:52     
  VOD      01-01   10:04    0.02     0:11
  VOD      01-01   10:15    0.01     0:10     
  BAT      01-01   10:25    0.03     0:39  
  BAT      01-01   11:04    0.02    22:00 
  BAT      01-02    9:04    0.02     0:05
  BAT      01-01   10:15    0.01     0:10     
  BOA      01-01   10:25    0.03     0:39  
  BOA      01-01   11:04    0.02    22:00 
  BOA      01-02    9:04    0.02     0:05

我知道如何以傳統方式做到這一點

def split_save(df):
    ids = df['stock'].unique()
    for id in ids:
        df = df[df['stock']==id]
        df.to_csv(f'{my_path}/{id}.csv')

但是，由於我有一個非常大的 dataframe 和數千只股票，我想進行多處理以加速。

任何想法？ （稍后我可能還會嘗試 pyspark。）

謝謝！

Answer 1

由於涉及 I/O，我不希望選擇 dataframe 成為主要阻塞點。

到目前為止，我可以為您提供兩種加快速度的解決方案：

線程：只需在不同的線程或ThreadPoolExecutor中啟動每只股票

def dump_csv(df, ticker):
    df.groupby(ticker).to_csv(f'{my_path}/{ticker}.csv')

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    futures = {executor.submit(df, ticker):ticker for ticker in df['stock'].unique()}
    for future in concurrent.futures.as_completed(futures):
        print(f"Dumped ticker {futures[future]}")

（代碼未經測試，改編自示例）

在 ZIP 文件中工作：對於存儲許多文件，zip 檔案是一個很好的選擇，但它應該得到“讀者”的支持。

為了完整起見：

with ZipFile('stocks.zip', 'w', compression=zipfile.ZIP_DEFLATED) as zf:
    ids = df['stock'].unique()
    for id in ids:
        zf.writestr(f'{id}.csv', df.groupby(ticker).to_csv())

Answer 2

我懷疑groupby是阻礙你前進的原因，但對於寫作，我們可以通過這樣的multithreading來加快速度：

from concurrent.futures import ThreadPoolExecutor

# Number of cores/threads your CPU has/that you want to use.
workers = 4 

def save_group(grouped):
    name, group = grouped
    group.to_csv(f'{name}.csv')

with ThreadPoolExecutor(workers) as pool:
    processed = pool.map(save_group, df.groupby('stock'))

pandas 使用並行處理按列值拆分數據幀

問題描述

2 個解決方案

解決方案1
2 2022-05-21 21:44:02

解決方案2
2 已采納 2022-05-21 23:04:06

pandas 使用並行處理按列值拆分數據幀

問題描述

2 個解決方案

解決方案1 2 2022-05-21 21:44:02

解決方案2 2 已采納 2022-05-21 23:04:06

解決方案1
2 2022-05-21 21:44:02

解決方案2
2 已采納 2022-05-21 23:04:06