[英]Pandas compare strings in two columns within the same dataframe with conditional output to new column
I have two columns within a data frame containing strings.我在包含字符串的数据框中有两列。 For example,例如,
import pandas as pd
import numpy as np
data = [['Oct-2019', 'Oranges + Grapes + Pears', 'Grapes + Pears'],
['Nov-2019', 'Oranges + Grapes + Pears', 'Oranges + Grapes + Pears']]
df = pd.DataFrame(data, columns =['Date', 'Previous shopping list', 'Recent shopping list'])
print(df)
Fish = ['Salmon', 'Trout']
Fruit = ['Oranges', 'Grapes', 'Pears']
Date PSL RSL
0 Oct-2019 Oranges + Grapes Grapes + Pears
+ Pears + Salmon
1 Nov-2019 Oranges + Grapes Oranges + Grapes
+ Pears + Trout + Pears
I want to compare the strings in both columns and have a text output to a new column that says what has changed between the two lists.我想比较两列中的字符串,并有一个文本输出到一个新列,说明两个列表之间发生了什么变化。 Such as, creating a column that will check for the strings related to "Fruit" and output what fruit has been dropped from the recent shopping when compared to the previous list previous shopping list.例如,创建一个列来检查与“水果”相关的字符串,并输出与上一个列表之前的购物列表相比,最近购物时丢弃了哪些水果。 See Desired output below:请参阅下面的所需输出:
Date PSL RSL Fruit lost Fish Lost
0 Oct-2019 Oranges + Grapes Grapes + Pears Oranges Salmon
+ Pears + Salmon
1 Nov-2019 Oranges + Grapes Oranges + Grapes Trout
+ Pears + Trout + Pears
How would I be able to achieve this in using pandas!我如何能够通过使用熊猫来实现这一目标! Apologies if this was not clear the first time!如果第一次看不清楚,请见谅!
Thank you for any suggestion/help!感谢您的任何建议/帮助!
The exact function that you use to process the data depends on your exact output that you require for each combination.您用于处理数据的确切函数取决于您对每个组合所需的确切输出。 Hopefully below will give you enough to create a solution for your problem:希望以下内容可以为您提供足够的解决方案来解决您的问题:
# process data so each row contains a list of elements
df['PSL_processed'] = df['Previous shopping list'].str.split('+')
df['RSL_processed'] = df['Recent shopping list'].str.split('+')
def compare_items(x):
if set(x.PSL_processed) == set(x.RSL_processed):
return 'No change'
elif set(x.PSL_processed) - set(x.CSL_processed) > 0:
return 'Lost'
# add in conditional logic here, to meet specification
df.apply(compare_items, axis=1)
The official documentation for pd.apply()
is well written. pd.apply()
的官方文档pd.apply()
很好。
要检查“最近的购物清单”中是否存在字符串“Oranges”并根据结果创建一个新列“Oranges Lost”:
df['Oranges Lost'] = np.where(df['Recent shopping list'].str.contains('Oranges'), 'No Change', 'Lost')```
So Mark's solution works well to grab the difference between the lists所以 Mark 的解决方案可以很好地抓住列表之间的差异
# process data so each row contains a list of elements
df['PSL_processed'] = df['Previous shopping list'].str.split()
df['RSL_processed'] = df['Recent shopping list'].str.split()
def compare_items(x):
return set(x.PSL_processed) - set(x.RSL_processed)
# add in conditional logic here, to meet specification
df['Products_lost'] = df.apply(compare_items, axis=1)
print(df)
On top to that to find the products that = fruit and the products = fish I used the following:除此之外,为了找到=水果和产品=鱼的产品,我使用了以下内容:
for idx, row in df.iterrows():
for c in Fruit:
if c in row['Products_lost']:
df.ix[idx, 'Fruit lost'] = c
for c in Fish:
if c in row['Products_lost']:
df.ix[idx, 'Fish lost'] = c
Seems to work well!似乎运作良好!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.