[英]How to use a dictionary to speed up the task of look up and counting?
考慮以下代碼段:
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
# 20,00,000 such rows
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]
給定一個巨大的 dataframe 和 2 個列表,我想計算 new_list 中與 dataframe 相同的元素數量。 在上面的偽示例中,結果將為 3,因為:“aaa-fff”、“ccc-ggg”和“ddd-ccc”在 dataframe 的同一行中。
現在,我正在使用線性搜索算法,但它非常慢,因為我必須掃描整個 dataframe。
df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
c1 = 0
for b in list_b:
str1=a+"-"+b
str2=b+"-"+a
str1=a+"-"+b
c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
c1+=c2
return c1
有人可以幫我實現一個更快的算法,最好使用字典數據結構嗎?
注意:我必須遍歷另一個 dataframe 的 7,000 行並動態創建 2 個列表,並獲取每行的聚合計數。
這是另一種方式。 首先,我使用了您對 df(有 2 列)、list_a 和 list_b 的定義。
# combine two columns in the data frame
df['col3'] = df['col1'] + '-' + df['col2']
# create set with list_a and list_b pairs
s = ({ f'{a}-{b}' for a, b in zip(list_a, list_b)} |
{ f'{b}-{a}' for a, b in zip(list_a, list_b)})
# find intersection
result = set(df['col3']) & s
print(len(result), '\n', result)
3
{'ddd-ccc', 'ccc-ggg', 'aaa-fff'}
更新處理重復值。
# build list (not set) from list_a and list_b
idx = ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
[ f'{b}-{a}' for a, b in zip(list_a, list_b) ])
# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()
# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]
# results:
ddd-ccc 1
aaa-fff 1
ccc-ggg 1
Name: col3, dtype: int64
嘗試這個:
from itertools import product
# all combinations of the two lists as tuples
all_list_combinations = list(product(list_a, list_b))
# tuples of the two columns
dftuples = [x for x in df.itertuples(index=False, name=None)]
# take the length of hte intersection of the two sets and print it
print(len(set(dftuples).intersection(set(all_list_combinations))))
產量
3
首先在循環之前加入列,然后不是循環傳遞一個可選的正則表達式來包含所有可能的字符串。
joined = df.col1+ '-' + df.col2
pat = '|'.join([f'({a}-{b})' for a in list_a for b in list_b] +
[f'({b}-{a})' for a in list_a for b in list_b]) # substitute for itertools.product
ct = joined.str.contains(pat).sum()
要使用 dicts 而不是數據框,您可以使用filter(re, joined)
,如this question
import re
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
### build the regex pattern
pat_set = set('-'.join(combo) for combo in set(
list(itertools.product(list_a, list_b)) +
list(itertools.product(list_b, list_a))))
pat = '|'.join(pat_set)
# use itertools to generalize with many colums, remove duplicates with set()
### join the columns row-wise
joined = ['-'.join(row) for row in zip(*[vals for key, vals in data.items()])]
### filter joined
match_list = list(filter(re.compile(pat).match, joined))
ct = len(match_list)
series.isin series.isin()
的第三個選項受jsmart 的回答啟發
joined = df.col1 + '-' + df.col2
ct = joined.isin(pat_set).sum()
速度測試
我重復數據 100,000 次以進行可擴展性測試。 series.isin()
花了一天時間,而 jsmart 的答案很快,但沒有找到所有出現的地方,因為它從joined
中刪除了重復項
with dicts: 400000 matches, 1.00 s
with pandas: 400000 matches, 1.77 s
with series.isin(): 400000 matches, 0.39 s
with jsmart answer: 4 matches, 0.50 s
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.