简体   繁体   English

如何遍历pandas df列,查找字符串是否包含来自单独的pandas df列的任何字符串?

[英]How to loop through pandas df column, finding if string contains any string from a separate pandas df column?

I have two pandas DataFrames in python.我在 python 中有两个 Pandas DataFrames。 DF A contains a column, which is basically sentence-length strings. DF A 包含一列,它基本上是句子长度的字符串。

|---------------------|------------------|
|        sentenceCol  |    other column  |
|---------------------|------------------|
|'this is from france'|         15       |
|---------------------|------------------|

DF B contains a column that is a list of countries DF B 包含一列国家列表

|---------------------|------------------|
|        country      |    other column  |
|---------------------|------------------|
|'france'             |         33       |
|---------------------|------------------|
|'spain'              |         34       |
|---------------------|------------------|

How can I loop through DF A and assign which country the string contains?如何遍历 D A 并分配字符串包含的国家/地区? Here's what I imagine DF A would look like after assignment...这就是我想象的 DF A 分配后的样子......

|---------------------|------------------|-----------|
|        sentenceCol  |    other column  | country   |
|---------------------|------------------|-----------|
|'this is from france'|         15       |  'france' |
|---------------------|------------------|-----------|

One additional complication is that there can be more than one country per sentence, so ideally this could assign every applicable country to that sentence.另一个复杂情况是每个句子可以有多个国家,因此理想情况下,这可以将每个适用的国家分配给该句子。

|-------------------------------|------------------|-----------|
|        sentenceCol            |    other column  | country   |
|-------------------------------|------------------|-----------|
|'this is from france and spain'|         16       |  'france' |
|-------------------------------|------------------|-----------|
|'this is from france and spain'|         16       |  'spain'  |
|-------------------------------|------------------|-----------|

There's no need for a loop here.这里不需要循环。 Looping over a dataframe is slow and we have optimized pandas or numpy methods for almost all of our problems.循环数据帧很慢,我们已经针对几乎所有问题优化了pandasnumpy方法。

In this case, for your first problem , you are looking for Series.str.extract :在这种情况下,对于您的第一个问题,您正在寻找Series.str.extract

dfa['country'] = dfa['sentenceCol'].str.extract(f"({'|'.join(dfb['country'])})")

           sentenceCol  other column country
0  this is from france            15  france

For your second problem , you need Series.str.extractall with Series.drop_duplicates & to_numpy :对于你的第二个问题,你需要Series.str.extractallSeries.drop_duplicates & to_numpy

dfa['country'] = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .drop_duplicates()
        .to_numpy()
)

                     sentenceCol  other column country
0  this is from france and spain            15  france
1  this is from france and spain            15   spain

Edit编辑

Or if your sentenceCol is not duplicated, we have to get the extracted values to a single row.或者,如果您的sentenceCol没有重复,我们必须将提取的值放到一行中。 We use GroupBy.agg :我们使用GroupBy.agg

dfa['country'] = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .groupby(level=0)
        .agg(', '.join)
        .to_numpy()
)

                     sentenceCol  other column        country
0  this is from france and spain            15  france, spain

Edit2编辑2

To duplicate the original rows.复制原始行。 We join the dataframe back to our extraction:我们将数据框join到我们的提取中:

extraction = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .rename(columns={0: 'country'})
)

dfa = extraction.droplevel(1).join(dfa).reset_index(drop=True)

  country                    sentenceCol  other column
0  france  this is from france and spain            15
1   spain  this is from france and spain            15

Dataframes used:使用的数据帧:

dfa = pd.DataFrame({'sentenceCol':['this is from france and spain']*2,
                   'other column':[15]*2})

dfb = pd.DataFrame({'country':['france', 'spain']})

You can iterate through a dataframe with the method iterrows() .您可以使用iterrows()方法遍历数据帧。 You can try this:你可以试试这个:

# Dataframes definition
df_1 = pd.DataFrame({"sentence": ["this is from france and spain", "this is from france", "this is from germany"], "other": [15, 12, 33]})
df_2 = pd.DataFrame({"country": ["spain", "france", "germany"], "other_column": [7, 7, 8]})


# Create the new dataframe
df_3 = pd.DataFrame(columns = ["sentence", "other_column", "country"])
count=0

# Iterate through the dataframes, first through the country dataframe and inside through the sentence one.
for index, row in df_2.iterrows():
    country = row.country

    for index_2, row_2 in df_1.iterrows():
        if country in row_2.sentence:
            df_3.loc[count] = (row_2.sentence, row_2.other, country)
            count+=1

So the output is:所以输出是:

sentence                            other_column    country
0   this is from france and spain   15              spain
1   this is from france and spain   15              france
2   this is from france             12              france
3   this is from germany            33              germany

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM