[英]Regex in python dataframe: count occurences of pattern
I want to count how often a regex-expression (prior and ensuing characters are needed to identify the pattern) occurs in multiple dataframe columns.我想计算一个正则表达式(需要先验和随后的字符来识别模式)在多个数据框列中出现的频率。 I found a solution which seems a litte slow.
我找到了一个似乎有点慢的解决方案。 Is there a more sophisticated way?
有没有更高级的方法?
column_A![]() |
column_B![]() |
column_C![]() |
---|---|---|
Test • test abc![]() |
winter • sun![]() |
snow rain blank![]() |
blabla • summer abc ![]() |
break • Data![]() |
test letter • stop.![]() |
So far I created a solution which is slow:到目前为止,我创建了一个缓慢的解决方案:
print(df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum())
You can use list comprehension
and re.search
.您可以使用
list comprehension
和re.search
。 You can reduce 938 µs
to 26.7 µs
.您可以将
938 µs
减少到26.7 µs
。 (make sure don't create list
and use generator
) (确保不要创建
list
并使用generator
)
res = sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item))
for col in ['column_A', 'column_B','column_C'])
print(res)
# 5
Benchmark:基准:
%%timeit
sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item)) for col in ['column_A', 'column_B','column_C'])
# 26 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()
# 938 µs ± 149 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# --------------------------------------------------------------------#
The str.count
should be able to apply to the whole dataframe without hard coding this way. str.count
应该能够应用于整个数据帧,而无需以这种方式进行硬编码。 Try尝试
sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
I have tried with 1000 * 1000 dataframes.我尝试过 1000 * 1000 个数据帧。 Here is a benchmark for your reference.
这是一个基准供您参考。
%timeit sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
1.97 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.