Pandas 比较同一数据框中两列中的字符串，并有条件地输出到新列

Question

I have two columns within a data frame containing strings.我在包含字符串的数据框中有两列。 For example,例如，

import pandas as pd
import numpy as np

data = [['Oct-2019', 'Oranges + Grapes + Pears', 'Grapes + Pears'],
       ['Nov-2019', 'Oranges + Grapes + Pears', 'Oranges + Grapes + Pears']]

df = pd.DataFrame(data, columns =['Date', 'Previous shopping list', 'Recent shopping list'])
print(df)

Fish = ['Salmon', 'Trout']
Fruit = ['Oranges', 'Grapes', 'Pears']

     Date     PSL                 RSL
0  Oct-2019   Oranges + Grapes    Grapes + Pears
              + Pears + Salmon                     

1  Nov-2019   Oranges + Grapes    Oranges + Grapes
              + Pears + Trout     + Pears

I want to compare the strings in both columns and have a text output to a new column that says what has changed between the two lists.我想比较两列中的字符串，并有一个文本输出到一个新列，说明两个列表之间发生了什么变化。 Such as, creating a column that will check for the strings related to "Fruit" and output what fruit has been dropped from the recent shopping when compared to the previous list previous shopping list.例如，创建一个列来检查与“水果”相关的字符串，并输出与上一个列表之前的购物列表相比，最近购物时丢弃了哪些水果。 See Desired output below:请参阅下面的所需输出：

     Date     PSL                 RSL               Fruit lost   Fish Lost
0  Oct-2019   Oranges + Grapes    Grapes + Pears    Oranges      Salmon
              + Pears + Salmon                     

1  Nov-2019   Oranges + Grapes    Oranges + Grapes               Trout
              + Pears + Trout     + Pears

How would I be able to achieve this in using pandas!我如何能够通过使用熊猫来实现这一目标！ Apologies if this was not clear the first time!如果第一次看不清楚，请见谅！

Thank you for any suggestion/help!感谢您的任何建议/帮助！

Answer 1

The exact function that you use to process the data depends on your exact output that you require for each combination.您用于处理数据的确切函数取决于您对每个组合所需的确切输出。 Hopefully below will give you enough to create a solution for your problem:希望以下内容可以为您提供足够的解决方案来解决您的问题：

# process data so each row contains a list of elements
df['PSL_processed'] = df['Previous shopping list'].str.split('+')
df['RSL_processed'] = df['Recent shopping list'].str.split('+')

def compare_items(x):
    if set(x.PSL_processed) == set(x.RSL_processed):
        return 'No change'
    elif set(x.PSL_processed) - set(x.CSL_processed) > 0:
        return 'Lost'
    # add in conditional logic here, to meet specification

df.apply(compare_items, axis=1)

The official documentation for pd.apply() is well written. pd.apply()的官方文档pd.apply()很好。

Answer 2

要检查“最近的购物清单”中是否存在字符串“Oranges”并根据结果创建一个新列“Oranges Lost”：

df['Oranges Lost'] = np.where(df['Recent shopping list'].str.contains('Oranges'), 'No Change', 'Lost')```

Answer 3

So Mark's solution works well to grab the difference between the lists所以 Mark 的解决方案可以很好地抓住列表之间的差异

# process data so each row contains a list of elements
df['PSL_processed'] = df['Previous shopping list'].str.split()
df['RSL_processed'] = df['Recent shopping list'].str.split()

def compare_items(x):
    return set(x.PSL_processed) - set(x.RSL_processed)
    # add in conditional logic here, to meet specification
df['Products_lost'] = df.apply(compare_items, axis=1)

print(df)

On top to that to find the products that = fruit and the products = fish I used the following:除此之外，为了找到=水果和产品=鱼的产品，我使用了以下内容：

for idx, row in df.iterrows():
    for c in Fruit:
        if c in row['Products_lost']:
            df.ix[idx, 'Fruit lost'] = c
            for c in Fish:
                if c in row['Products_lost']:
                    df.ix[idx, 'Fish lost'] = c

Seems to work well!似乎运作良好！

Pandas 比较同一数据框中两列中的字符串，并有条件地输出到新列

问题描述

3 个解决方案

解决方案1
1 2020-01-22 21:03:41

解决方案2
1 2020-01-22 21:07:53

解决方案3
0 已采纳 2020-01-23 14:08:12

Pandas 比较同一数据框中两列中的字符串，并有条件地输出到新列

问题描述

3 个解决方案

解决方案1 1 2020-01-22 21:03:41

解决方案2 1 2020-01-22 21:07:53

解决方案3 0 已采纳 2020-01-23 14:08:12

解决方案1
1 2020-01-22 21:03:41

解决方案2
1 2020-01-22 21:07:53

解决方案3
0 已采纳 2020-01-23 14:08:12