python数据框中的正则表达式：计算模式的出现次数

Question

I want to count how often a regex-expression (prior and ensuing characters are needed to identify the pattern) occurs in multiple dataframe columns.我想计算一个正则表达式（需要先验和随后的字符来识别模式）在多个数据框列中出现的频率。 I found a solution which seems a litte slow.我找到了一个似乎有点慢的解决方案。 Is there a more sophisticated way?有没有更高级的方法？

column_A列_A	column_B列_B	column_C列_C
Test • test abc测试 • 测试 abc	winter • sun冬天的太阳	snow rain blank雪雨空白
blabla • summer abc blabla • 夏天 abc	break • Data中断 • 数据	test letter • stop.测试信•停止。

So far I created a solution which is slow:到目前为止，我创建了一个缓慢的解决方案：

print(df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum())

Answer 1

You can use list comprehension and re.search .您可以使用list comprehension和re.search 。 You can reduce 938 µs to 26.7 µs .您可以将938 µs减少到26.7 µs 。 (make sure don't create list and use generator ) （确保不要创建list并使用generator ）

res = sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item))
       for col in ['column_A', 'column_B','column_C'])
print(res)
# 5

Benchmark:基准：

%%timeit 
sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item)) for col in ['column_A', 'column_B','column_C'])
# 26 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit 
df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()
# 938 µs ± 149 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# --------------------------------------------------------------------#

Answer 2

The str.count should be able to apply to the whole dataframe without hard coding this way. str.count应该能够应用于整个数据帧，而无需以这种方式进行硬编码。 Try尝试

sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))

I have tried with 1000 * 1000 dataframes.我尝试过 1000 * 1000 个数据帧。 Here is a benchmark for your reference.这是一个基准供您参考。

%timeit sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
1.97 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

python数据框中的正则表达式：计算模式的出现次数

问题描述

2 个解决方案

解决方案1
1 2022-07-22 07:35:04

解决方案2
0 2022-07-22 07:41:01

python数据框中的正则表达式：计算模式的出现次数

问题描述

2 个解决方案

解决方案1 1 2022-07-22 07:35:04

解决方案2 0 2022-07-22 07:41:01

解决方案1
1 2022-07-22 07:35:04

解决方案2
0 2022-07-22 07:41:01