如何將正則表達式 function 應用於 dataframe 列以返回值

Question

我正在嘗試將正則表達式 function 應用於 dataframe 的列以確定性別代詞。 這是我的 dataframe 的樣子：

    name                                            Descrip
0  Sarah           she doesn't like this because her mum...
1  David                 he does like it because his dad...
2    Sam  they generally don't like it because their par...

這些是我運行的代碼來制作 dataframe：

list_label = ["Sarah", "David", "Sam"]
list_descriptions = ["she doesn't like this because her mum...", "he does like it because his dad...", "they generally don't like it because their parent..."]

data3 = {'name':list_label, 'Descrip':list_descriptions}
test_df = pd.DataFrame(data3)

我試圖通過在“描述”列上應用正則表達式 function 來確定此人的性別。 具體來說，這些是我想要實現的模式：

"male":"(he |his |him )",
"female":"(she |her |hers )",
"plural, or singular non-binary":"(they |them |their )"

我寫的完整代碼如下：

此 function 嘗試匹配每個模式並返回在行值描述中最常提及的性別代詞的名稱。 每個性別代詞在模式字符串中都有幾個關鍵詞（例如，他、她、他們）。這個想法是確定 max_gender 或與在描述列中的值中最常提到的模式組相關聯的性別。 因此，max_gender 可以采用以下三個值之一：male | 女| 復數，或單數非二進制。 如果在整個描述行值中沒有識別出任何模式，則將返回“未知”。

import re
def get_pronouns(text):
    patterns = {
        "male":"(he |his |him )",
        "female":"(she |her |hers )",
        "plural, or singular non-binary":"(they |them |their )"
    }
    max_gender = "unknown"
    max_gender_count = 0
    for gender in patterns:
        pattern = re.compile(gender)
        mentions = re.findall(pattern, text)
        count_mentions = len(mentions)
        if count_mentions > max_gender_count:
            max_gender_count = count_mentions
            max_gender = gender
    return max_gender

test_df["pronoun"] = test_df.loc[:, "Descrip"].apply(get_pronouns)
print(test_df)

但是，當我運行代碼時，它顯然無法確定性別代詞。 這在以下 output 中顯示：

    name                                            Descrip  pronoun
0  Sarah           she doesn't like this because her mum...  unknown
1  David                 he does like it because his dad...  unknown
2    Sam  they generally don't like it because their par...  unknown

有誰知道我的代碼有什么問題？

Answer 1

如果您想發現代碼不起作用的原因，請在 function 中添加一條打印語句，如下所示：

    for gender in patterns:
        print(gender)
        pattern = re.compile(gender)

您的正則表達式還需要一些調整。 例如，在 Pink Floyd 的歌曲 Breathe 的第一行中， Breathe, Breath in the air ，您的正則表達式找到了兩個男性代詞。

可能還有其他問題，我不確定。

這是一個與您的解決方案非常相似的解決方案。 正則表達式是固定的，字典被元組列表替換，等等。

解決方案代碼

import pandas as pd
import numpy as np
import re
import operator as op

names_list = ['Sarah', 'David', 'Sam']
descs_list = ["she doesn't like this because her mum...", 'he does like it because his dad...',
              "they generally don't like it because their parent..."]

df_1 = pd.DataFrame(data=zip(names_list, descs_list), columns=['Name', 'Desc'])

pronoun_re_list = [('male', re.compile(r"\b(?:he|his|him)\b", re.IGNORECASE)),
                   ('female', re.compile(r"\b(?:she|her|hers)\b", re.IGNORECASE)),
                   ('plural/nb', re.compile(r"\b(?:they|them|their)\b", re.IGNORECASE))]


def detect_pronouns(str_in: str) -> str:
    match_results = ((curr_pron, len(curr_patt.findall(str_in))) for curr_pron, curr_patt in pronoun_re_list)
    max_pron, max_counts = max(match_results, key=op.itemgetter(1))
    if max_counts == 0:
        return np.NaN
    else:
        return max_pron


df_1['Pronouns'] = df_1['Desc'].map(detect_pronouns)

解釋

代碼

match_results是一個生成器表達式。 curr_pron代表“當前代詞”， curr_patt代表“當前模式”。 如果我將它重寫為創建列表的 for 循環，它可能會讓事情更清楚：

    match_results = []
    for curr_pron, curr_patt in pronoun_re_list:
        match_counts = len(curr_patt.findall(str_in))
        match_results.append((curr_pron, match_counts))

for curr_pron, curr_patt in...正在利用一些不同名稱的東西，通常是多重賦值或元組解包。 你可以在這里找到一篇不錯的文章。 在這種情況下，它只是一種不同的寫作方式：

    for curr_tuple in pronoun_re_list:
        curr_pron = curr_tuple[0]
        curr_patt = curr_tuple[1]

正則表達式

大家最喜歡的科目的時間； 正則表達式！ 我使用了一個名為RegEx101的很棒的網站，你可以在那里亂搞模式，它讓事情變得更容易理解。 我已經設置了一個頁面，其中包含一些測試數據和下面將介紹的正則表達式： https://regex101.com/r/Y1onRC/2 。

現在，讓我們看一下我使用的正則表達式： \b(?:he|his|him)\b 。

he|his|him部分與你的完全一樣，它匹配單詞“he”、“his”或“him”。 在您的正則表達式中，被括號包圍，我的還包括?:在左括號之后。 (pattern stuff)是一個捕獲組，顧名思義，它捕獲匹配的任何內容。 因為這里我們實際上並不關心匹配的內容，只關心是否匹配，我們添加?:來創建一個不捕獲（或保存）內容的非捕獲組。

我說正則表達式的he|his|him部分與您的相同，但這並不完全正確。 您在每個代詞之后都包含一個空格，大概是為了避免它與單詞中間的字母he匹配。 不幸的是，正如我上面提到的，它在句子Breathe, Breath in the air中找到了兩個匹配項。 我們的救星是\b ，它匹配單詞邊界。 這意味着我們在Words words words he 中捕捉到 he 。 ，而(he |his |him )沒有。

最后，我們使用re.IGNORECASE標志編譯模式，我認為不需要太多解釋，但如果我錯了請告訴我。

以下是我用簡單的英語描述這兩種模式的方式：

(he |his |him )匹配he后跟一個空格、 his后跟一個空格或him后跟一個空格的字母，並返回完整匹配加上一個組。
帶有re.IGNORECASE標志的\b(?:he|his|him)\b匹配單詞he 、 his或him ，無論大小寫如何，並返回完整匹配。

希望這足夠清楚，讓我知道！

結果 output

    Name    Desc                                                  Pronouns
--  ------  ----------------------------------------------------  ----------
 0  Sarah   she doesn't like this because her mum...              female
 1  David   he does like it because his dad...                    male
 2  Sam     they generally don't like it because their parent...  plural/nb

如果您有任何問題，請告訴我：）

Answer 2

試試這個，它應該工作：

def apply_pronouns_to_col(hm):
    dff = {"male": [r'^he', r' he$', r'.* he .*', r'^his', r' his$', r'.* his .*',],
           "female": [r'^she', r' she$', r'.* she .*', r'^her', r' her$', r'.* her .*', r'^hers', r' hers$', r'.* hers .*'],
           "plural": [r'^they', r' they$', r'.* they .*', r'^them', r' them$', r'.* them .*', r'^their', r' their$', r'.* their .*']}
    strmatch = pd.Series(hm)
    for key, words in dff.items():
        for word in words:
            if strmatch.str.match(word)[0]:
               return key

df['pronouns'] = df.Descrip.apply(apply_pronouns_to_col)

df
#Out[2009]: 
#    name                                     Descrip pronoun
#0  Sarah           she doesn't like this because her  female
#1  David                 he does like it because his    male
#2    Sam  they generally don't like it because their   plural

如何將正則表達式 function 應用於 dataframe 列以返回值

問題描述

1 個解決方案

解決方案1
2 已采納 2019-11-22 00:38:47

解決方案代碼

解釋

代碼

正則表達式

結果 output

解決方案2
-1 2019-11-22 00:32:37

如何將正則表達式 function 應用於 dataframe 列以返回值

問題描述

1 個解決方案

解決方案1 2 已采納 2019-11-22 00:38:47

解決方案代碼

解釋

代碼

正則表達式

結果 output

解決方案2 -1 2019-11-22 00:32:37

解決方案1
2 已采納 2019-11-22 00:38:47

解決方案2
-1 2019-11-22 00:32:37