使用 pandas 按另一列中的值計算一列中的正則表達式匹配

Question

我正在使用 pandas 並有一個 dataframe 包含一個句子列表和說它們的人，如下所示：

 sentence                 person
 'hello world'              Matt
 'cake, delicious cake!'    Matt
 'lovely day'               Maria
 'i like cake'             Matt
 'a new day'                Maria
 'a new world'              Maria

我想按person計算sentence （例如cake 、 world 、 day ）中正則表達式字符串的非重疊匹配。 請注意，每行sentence可能包含多個匹配項（例如cake ）：

person        'day'        'cake'       'world'
Matt            0            3             1
Maria           2            0             1

到目前為止，我正在這樣做：

rows_cake = df[df['sentences'].str.contains(r"cake")
counts_cake = rows_cake.value_counts()

但是這個str.contains給了我包含cake的行，但不是cake的單個實例。

我知道我可以在rows_cake上使用str.counts(r"cake") 。 但是，在實踐中，我的 dataframe 非常大（> 1000 萬行），而且我使用的正則表達式非常復雜，所以如果可能的話，我正在尋找更有效的解決方案。

Answer 1

也許您應該首先嘗試獲取句子本身，然后使用re來執行優化的正則表達式，如下所示：

for row in df.itertuples(index=False):
   do_some_regex_stuff(row[0], row[1])#in this case row[0] is a sentence. row[1] is person

據我所知，itertuples 很快就安靜了（這里的第 1 條注釋）。 所以你唯一的優化問題是正則表達式本身。

Answer 2

我想出了相當簡單的解決方案。 但不能聲稱它是最快或最有效的。

import pandas as pd
import numpy as np

# to be used with read_clipboard()
'''
sentence    person
'hello world'   Matt
'cake, delicious cake!' Matt
'lovely day'    Maria
'i like cake'   Matt
'a new day' Maria
'a new world'   Maria
'''

df = pd.read_clipboard()
# print(df)

Output：

                  sentence person
0            'hello world'   Matt
1  'cake, delicious cake!'   Matt
2             'lovely day'  Maria
3            'i like cake'   Matt
4              'a new day'  Maria
5            'a new world'  Maria

.

# if the list of keywords is fix and relatively small
keywords = ['day', 'cake', 'world']

# for each keyword and each string, counting the occourance
for key in keywords:
    df[key] = [(len(val.split(key)) - 1) for val in df['sentence']]

# print(df)

Output：

                 sentence person  day  cake  world
0            'hello world'   Matt    0     0      1
1  'cake, delicious cake!'   Matt    0     2      0
2             'lovely day'  Maria    1     0      0
3            'i like cake'   Matt    0     1      0
4              'a new day'  Maria    1     0      0
5            'a new world'  Maria    0     0      1

.

# create a simple pivot with what data you needed
df_pivot = pd.pivot_table(df, 
values=['day', 'cake', 'world'], 
columns=['person'], 
aggfunc=np.sum).T

# print(df_pivot)

最終 Output：

        cake  day  world
person
Maria      0    2      1
Matt       3    0      1

如果這似乎是一個好方法，特別是考慮到數據量，歡迎提出建議。 想要學習。

Answer 3

因為這主要涉及字符串，所以我建議從 Pandas 中取出計算 - 在大多數情況下，當涉及到字符串操作時，Python 比 Pandas 快：

#read in data
df = pd.read_clipboard(sep='\s{2,}', engine='python')

#create a dictionary of persons and sentences : 
from collections import defaultdict, ChainMap
d = defaultdict(list)
for k,v in zip(df.person, df.sentence):
    d[k].append(v)


d = {k:",".join(v) for k,v in d.items()}

#search words
strings = ("cake", "world", "day")

#get count of words and create a dict
m = defaultdict(list)
for k,v in d.items():
    for st in strings:
        m[k].append({st:v.count(st)})

res = {k:dict(ChainMap(*v)) for k,v in m.items()}


print(res)
{'Matt': {'day': 0, 'world': 1, 'cake': 3},
 'Maria': {'day': 2, 'world': 1, 'cake': 0}}

output = pd.DataFrame(res).T

       day  world   cake
Matt    0     1     3
Maria   2     1     0

測試速度，看看哪個更好。 這對我和其他人也很有用。

使用 pandas 按另一列中的值計算一列中的正則表達式匹配

問題描述

3 個解決方案

解決方案1
0 2020-05-15 11:06:56

解決方案2
0 2020-05-15 11:51:34

解決方案3
0 2020-05-15 11:59:32

使用 pandas 按另一列中的值計算一列中的正則表達式匹配

問題描述

3 個解決方案

解決方案1 0 2020-05-15 11:06:56

解決方案2 0 2020-05-15 11:51:34

解決方案3 0 2020-05-15 11:59:32

解決方案1
0 2020-05-15 11:06:56

解決方案2
0 2020-05-15 11:51:34

解決方案3
0 2020-05-15 11:59:32