在 1000 萬個模式的 Pandas 數據框上執行 str.contains 並為每個模式獲取匹配的有效方法

Question

我有一個名為“子集”的熊貓數據框，我想計算其列“序列”包含名為motifs的集合中每個模式的行數。 我已經使用 for 循環完成了它，遍歷了一組主題並為每個主題確定了匹配項。 然而，這個集合很大，我有 1000 萬個圖案，完成這一步需要很長時間。 有沒有更有效的方法來為 1000 萬個模式做 str.contains ？

這是我的代碼：

motif_background = {}
for motif in motifs: ### loop through set of 12,000,000 motifs
 match = subset['sequence'].str.contains(motif).sum() ### get the number of rows whose 'sequence' column contains  the motif 
 motif_background[motif].append(match)

Answer 1

對於像您這樣的較大數據集，您可以使用多處理在多核上更快地並行計算結果。

這是工作代碼：

from multiprocessing import Pool
import os
import numpy as np

pool = Pool(os.cpu_count())
split_df_results = pool.map(fn_to_execute, np.array_split(df, num_cores))
df = pd.concat(split_df_results)
pool.close()
pool.join()

在 1000 萬個模式的 Pandas 數據框上執行 str.contains 並為每個模式獲取匹配的有效方法

問題描述

1 個解決方案

解決方案1
0 2020-11-19 22:36:10

在 1000 萬個模式的 Pandas 數據框上執行 str.contains 並為每個模式獲取匹配的有效方法

問題描述

1 個解決方案

解決方案1 0 2020-11-19 22:36:10

解決方案1
0 2020-11-19 22:36:10