[英]How to loop through pandas df column, finding if string contains any string from a separate pandas df column?
我在 python 中有兩個 Pandas DataFrames。 DF A 包含一列,它基本上是句子長度的字符串。
|---------------------|------------------|
| sentenceCol | other column |
|---------------------|------------------|
|'this is from france'| 15 |
|---------------------|------------------|
DF B 包含一列國家列表
|---------------------|------------------|
| country | other column |
|---------------------|------------------|
|'france' | 33 |
|---------------------|------------------|
|'spain' | 34 |
|---------------------|------------------|
如何遍歷 D A 並分配字符串包含的國家/地區? 這就是我想象的 DF A 分配后的樣子......
|---------------------|------------------|-----------|
| sentenceCol | other column | country |
|---------------------|------------------|-----------|
|'this is from france'| 15 | 'france' |
|---------------------|------------------|-----------|
另一個復雜情況是每個句子可以有多個國家,因此理想情況下,這可以將每個適用的國家分配給該句子。
|-------------------------------|------------------|-----------|
| sentenceCol | other column | country |
|-------------------------------|------------------|-----------|
|'this is from france and spain'| 16 | 'france' |
|-------------------------------|------------------|-----------|
|'this is from france and spain'| 16 | 'spain' |
|-------------------------------|------------------|-----------|
這里不需要循環。 循環數據幀很慢,我們已經針對幾乎所有問題優化了pandas
或numpy
方法。
在這種情況下,對於您的第一個問題,您正在尋找Series.str.extract
:
dfa['country'] = dfa['sentenceCol'].str.extract(f"({'|'.join(dfb['country'])})")
sentenceCol other column country
0 this is from france 15 france
對於你的第二個問題,你需要Series.str.extractall
和Series.drop_duplicates
& to_numpy
:
dfa['country'] = (
dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
.drop_duplicates()
.to_numpy()
)
sentenceCol other column country
0 this is from france and spain 15 france
1 this is from france and spain 15 spain
編輯
或者,如果您的sentenceCol
沒有重復,我們必須將提取的值放到一行中。 我們使用GroupBy.agg
:
dfa['country'] = (
dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
.groupby(level=0)
.agg(', '.join)
.to_numpy()
)
sentenceCol other column country
0 this is from france and spain 15 france, spain
編輯2
復制原始行。 我們將數據框join
到我們的提取中:
extraction = (
dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
.rename(columns={0: 'country'})
)
dfa = extraction.droplevel(1).join(dfa).reset_index(drop=True)
country sentenceCol other column
0 france this is from france and spain 15
1 spain this is from france and spain 15
使用的數據幀:
dfa = pd.DataFrame({'sentenceCol':['this is from france and spain']*2,
'other column':[15]*2})
dfb = pd.DataFrame({'country':['france', 'spain']})
您可以使用iterrows()
方法遍歷數據幀。 你可以試試這個:
# Dataframes definition
df_1 = pd.DataFrame({"sentence": ["this is from france and spain", "this is from france", "this is from germany"], "other": [15, 12, 33]})
df_2 = pd.DataFrame({"country": ["spain", "france", "germany"], "other_column": [7, 7, 8]})
# Create the new dataframe
df_3 = pd.DataFrame(columns = ["sentence", "other_column", "country"])
count=0
# Iterate through the dataframes, first through the country dataframe and inside through the sentence one.
for index, row in df_2.iterrows():
country = row.country
for index_2, row_2 in df_1.iterrows():
if country in row_2.sentence:
df_3.loc[count] = (row_2.sentence, row_2.other, country)
count+=1
所以輸出是:
sentence other_column country
0 this is from france and spain 15 spain
1 this is from france and spain 15 france
2 this is from france 12 france
3 this is from germany 33 germany
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.