簡體   English   中英

如何從 python pandas dataframe 的列中的列表中提取字符串?

[英]How to extract strings from a list in a column in a python pandas dataframe?

假設我有一個清單

lst = ["fi", "ap", "ko", "co", "ex"]

我們有這個系列

       Explanation 

a      "fi doesn't work correctly" 
b      "apples are cool" 
c      "this works but translation is ko" 

我正在尋找這樣的東西:

        Explanation                         Explanation Extracted

a      "fi doesn't work correctly"          "fi"
b      "apples are cool"                    "N/A"
c      "this works but translation is ko"   "ko"

與 dataframe 一樣

df = pd.DataFrame(
    {"Explanation": ["fi doesn't co work correctly",
                     "apples are cool",
                     "this works but translation is ko"]},
    index=["a", "b", "c"]
)

你可以使用.str.extract()來做

lst = ["fi", "ap", "ko", "co", "ex"]

pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)

要得到

                        Explanation Explanation Extracted
a      fi doesn't co work correctly                    fi
b                   apples are cool                   NaN
c  this works but translation is ko                    ko

正則表達式模式r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" lst之后以 withespace 開頭,中間以前后空格開頭,或以 withespace 結尾。 str.extract()提取捕獲組( ()中間的部分)。 沒有匹配返回是NaN

如果要提取多個匹配項,可以使用.str.findall()然后使用", ".join結果:

pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
    df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)

沒有正則表達式的替代方案:

df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
    matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)

如果您只想匹配句子的開頭或結尾,請將第一部分替換為:

df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
    (splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...

我認為這可以解決您的問題。

import pandas as pd

lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])

extracted =[] 
for index, row in df.iterrows():
    tempList =[] 
    rowSplit = row['Explanation'].split(" ")
    for val in rowSplit:
        if val in lst:
            tempList.append(val)
    if len(tempList)>0:
        extracted.append(','.join(tempList))
    else:
        extracted.append('N/A')

df['Explanation Extracted'] = extracted

apply Pandas 的 function 可能會有幫助

def extract_explanation(dataframe):
    custom_substring = ["fi", "ap", "ko", "co", "ex"]
    substrings = dataframe['explanation'].split(" ")
    explanation = "N/A"
    for string in substrings:
        if string in custom_substring:
            explanation = string
    return explanation

df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)

這里的問題是假設只有一種解釋,但如果需要多種解釋,它可以轉換成一個列表。

選項1

假設想要提取列表中的lst字符串,首先可以從創建正則表達式開始

regex = f'\\b({"|".join(lst)})\\b'

其中\b是單詞邊界(單詞的開頭或結尾),表示單詞后面沒有其他字符,或者前面沒有字符。 因此,考慮到列表lst中有字符串ap ,如果 dataframe 中有單詞apple ,則不會考慮。

然后,使用pandas.Series.str.extract ,並使其不區分大小寫,使用re.IGNORECASE

import re

df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)

[Out]:
   ID                       Explanation Explanation Extracted
0   1         fi doesn't work correctly                    fi
1   2                 cap ples are cool                   NaN
2   3  this works but translation is ko                    ko

選項 2

也可以將pandas.Series.apply與自定義 lambda function 一起使用。

df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))

[Out]:
   ID                       Explanation Explanation Extracted
0   1         fi doesn't work correctly                    fi
1   2                 cap ples are cool                   N/A
2   3  this works but translation is ko                    ko

筆記:

  • .lower()是讓它不區分大小寫。

  • .split()是一種方法,可以防止即使ap在列表中,字符串apple也不會出現在Explanation Extracted列中。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM