如何遍歷pandas df列，查找字符串是否包含來自單獨的pandas df列的任何字符串？

Question

我在 python 中有兩個 Pandas DataFrames。 DF A 包含一列，它基本上是句子長度的字符串。

|---------------------|------------------|
|        sentenceCol  |    other column  |
|---------------------|------------------|
|'this is from france'|         15       |
|---------------------|------------------|

DF B 包含一列國家列表

|---------------------|------------------|
|        country      |    other column  |
|---------------------|------------------|
|'france'             |         33       |
|---------------------|------------------|
|'spain'              |         34       |
|---------------------|------------------|

如何遍歷 D A 並分配字符串包含的國家/地區？ 這就是我想象的 DF A 分配后的樣子......

|---------------------|------------------|-----------|
|        sentenceCol  |    other column  | country   |
|---------------------|------------------|-----------|
|'this is from france'|         15       |  'france' |
|---------------------|------------------|-----------|

另一個復雜情況是每個句子可以有多個國家，因此理想情況下，這可以將每個適用的國家分配給該句子。

|-------------------------------|------------------|-----------|
|        sentenceCol            |    other column  | country   |
|-------------------------------|------------------|-----------|
|'this is from france and spain'|         16       |  'france' |
|-------------------------------|------------------|-----------|
|'this is from france and spain'|         16       |  'spain'  |
|-------------------------------|------------------|-----------|

Answer 1

這里不需要循環。 循環數據幀很慢，我們已經針對幾乎所有問題優化了pandas或numpy方法。

在這種情況下，對於您的第一個問題，您正在尋找Series.str.extract ：

dfa['country'] = dfa['sentenceCol'].str.extract(f"({'|'.join(dfb['country'])})")

           sentenceCol  other column country
0  this is from france            15  france

對於你的第二個問題，你需要Series.str.extractall和Series.drop_duplicates & to_numpy ：

dfa['country'] = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .drop_duplicates()
        .to_numpy()
)

                     sentenceCol  other column country
0  this is from france and spain            15  france
1  this is from france and spain            15   spain

編輯

或者，如果您的sentenceCol沒有重復，我們必須將提取的值放到一行中。 我們使用GroupBy.agg ：

dfa['country'] = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .groupby(level=0)
        .agg(', '.join)
        .to_numpy()
)

                     sentenceCol  other column        country
0  this is from france and spain            15  france, spain

編輯2

復制原始行。 我們將數據框join到我們的提取中：

extraction = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .rename(columns={0: 'country'})
)

dfa = extraction.droplevel(1).join(dfa).reset_index(drop=True)

  country                    sentenceCol  other column
0  france  this is from france and spain            15
1   spain  this is from france and spain            15

使用的數據幀：

dfa = pd.DataFrame({'sentenceCol':['this is from france and spain']*2,
                   'other column':[15]*2})

dfb = pd.DataFrame({'country':['france', 'spain']})

Answer 2

您可以使用iterrows()方法遍歷數據幀。 你可以試試這個：

# Dataframes definition
df_1 = pd.DataFrame({"sentence": ["this is from france and spain", "this is from france", "this is from germany"], "other": [15, 12, 33]})
df_2 = pd.DataFrame({"country": ["spain", "france", "germany"], "other_column": [7, 7, 8]})


# Create the new dataframe
df_3 = pd.DataFrame(columns = ["sentence", "other_column", "country"])
count=0

# Iterate through the dataframes, first through the country dataframe and inside through the sentence one.
for index, row in df_2.iterrows():
    country = row.country

    for index_2, row_2 in df_1.iterrows():
        if country in row_2.sentence:
            df_3.loc[count] = (row_2.sentence, row_2.other, country)
            count+=1

所以輸出是：

sentence                            other_column    country
0   this is from france and spain   15              spain
1   this is from france and spain   15              france
2   this is from france             12              france
3   this is from germany            33              germany

如何遍歷pandas df列，查找字符串是否包含來自單獨的pandas df列的任何字符串？

問題描述

2 個解決方案

解決方案1
3 已采納 2019-12-29 19:08:52

解決方案2
0 2019-12-29 19:07:12

如何遍歷pandas df列，查找字符串是否包含來自單獨的pandas df列的任何字符串？

問題描述

2 個解決方案

解決方案1 3 已采納 2019-12-29 19:08:52

解決方案2 0 2019-12-29 19:07:12

解決方案1
3 已采納 2019-12-29 19:08:52

解決方案2
0 2019-12-29 19:07:12