[英]How to extract keywords (strings) from a list in a column in a python pandas dataframe?
[英]How to extract strings from a list in a column in a python pandas dataframe?
假設我有一個清單
lst = ["fi", "ap", "ko", "co", "ex"]
我們有這個系列
Explanation
a "fi doesn't work correctly"
b "apples are cool"
c "this works but translation is ko"
我正在尋找這樣的東西:
Explanation Explanation Extracted
a "fi doesn't work correctly" "fi"
b "apples are cool" "N/A"
c "this works but translation is ko" "ko"
與 dataframe 一樣
df = pd.DataFrame(
{"Explanation": ["fi doesn't co work correctly",
"apples are cool",
"this works but translation is ko"]},
index=["a", "b", "c"]
)
你可以使用.str.extract()
來做
lst = ["fi", "ap", "ko", "co", "ex"]
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)
要得到
Explanation Explanation Extracted
a fi doesn't co work correctly fi
b apples are cool NaN
c this works but translation is ko ko
正則表達式模式r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
lst
之后以 withespace 開頭,中間以前后空格開頭,或以 withespace 結尾。 str.extract()
提取捕獲組( ()
中間的部分)。 沒有匹配返回是NaN
。
如果要提取多個匹配項,可以使用.str.findall()
然后使用", ".join
結果:
pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)
沒有正則表達式的替代方案:
df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)
如果您只想匹配句子的開頭或結尾,請將第一部分替換為:
df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
(splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...
我認為這可以解決您的問題。
import pandas as pd
lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])
extracted =[]
for index, row in df.iterrows():
tempList =[]
rowSplit = row['Explanation'].split(" ")
for val in rowSplit:
if val in lst:
tempList.append(val)
if len(tempList)>0:
extracted.append(','.join(tempList))
else:
extracted.append('N/A')
df['Explanation Extracted'] = extracted
apply
Pandas 的 function 可能會有幫助
def extract_explanation(dataframe):
custom_substring = ["fi", "ap", "ko", "co", "ex"]
substrings = dataframe['explanation'].split(" ")
explanation = "N/A"
for string in substrings:
if string in custom_substring:
explanation = string
return explanation
df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)
這里的問題是假設只有一種解釋,但如果需要多種解釋,它可以轉換成一個列表。
選項1
假設想要提取列表中的lst
字符串,首先可以從創建正則表達式開始
regex = f'\\b({"|".join(lst)})\\b'
其中\b
是單詞邊界(單詞的開頭或結尾),表示單詞后面沒有其他字符,或者前面沒有字符。 因此,考慮到列表lst
中有字符串ap
,如果 dataframe 中有單詞apple
,則不會考慮。
然后,使用pandas.Series.str.extract
,並使其不區分大小寫,使用re.IGNORECASE
import re
df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool NaN
2 3 this works but translation is ko ko
選項 2
也可以將pandas.Series.apply
與自定義 lambda function 一起使用。
df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))
[Out]:
ID Explanation Explanation Extracted
0 1 fi doesn't work correctly fi
1 2 cap ples are cool N/A
2 3 this works but translation is ko ko
筆記:
.lower()
是讓它不區分大小寫。
.split()
是一種方法,可以防止即使ap
在列表中,字符串apple
也不會出現在Explanation Extracted
列中。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.