简体   繁体   English

如何识别 Python Pandas Data Frame 列中值的顺序?

[英]How to recognize order of values in column in Python Pandas Data Frame?

I have Data Frame in Python Pandas like below:我在 Python Pandas 中有数据框,如下所示:

import pandas as pd
import re
df = pd.DataFrame()
df["ADRESAT"] = ["Kowal Jan", "Nowak Adam PHU"]
df["NADAWCA"] = ["Jan Kowal", "Adam Nowak"]

And I had created 2 new columns:我创建了 2 个新列:

  • col1 - value from column "NADAWCA" which is in column "ADRESAT" col1 - 来自“ADRESAT”列中的“NADAWCA”列的值

  • col2 - rest of values (values from column "ADRESAT" beyon values which are also in column "NADAWCA") col2 - 其余值(“ADRESAT”列中的值超出“NADAWCA”列中的值)

    df["col2"] = df.apply(lambda r: re.sub(r["NADAWCA"], '', r["ADRESAT"], flags = re.IGNORECASE).strip(), axis=1) df["col1"] = df["NADAWCA"].str.title() df["col2"] = df.apply(lambda r: re.sub(r["NADAWCA"], '', r["ADRESAT"], flags = re.IGNORECASE).strip(),axis=1) df["col1"] = df["NADAWCA"].str.title()

Nevertheless, as a result I have df like below.尽管如此,结果我有如下 df 。 But as you can see in second row there is a mistake.但是正如您在第二行中看到的那样,有一个错误。

  • In col1 is ok (value from column "NADAWCA" which are also in column "ADRESAT" but在 col1 中是可以的(来自“NADAWCA”列的值也位于“ADRESAT”列中,但是
  • in col2 I need to have only PHU (means values from column "ADRESAT" beyond valyes which ares also in column "NADAWCA")在 col2 中,我只需要 PHU(意味着来自“ADRESAT”列的值超出 valyes,也在列“NADAWCA”中)

在此处输入图片说明

My question: How to modify my code so as to recognize that Adam Nowak and Nowak Adam is the same value ?我的问题:如何修改我的代码以识别 Adam Nowak 和 Nowak Adam 是相同的值?

I need result as below :我需要如下结果:

在此处输入图片说明

As the order does matter, using set is not possible, So we need to check each word one by one:由于顺序很重要,使用set是不可能的,所以我们需要一个一个检查每个单词:

# x[0] -> ADRESAT, x[1] -> NADAWCA
intersection = lambda x: ' '.join([x1 for x1 in x[1].split()
                             if x1.lower() in x[0].lower().split()])

difference = lambda x: ' '.join([x0 for x0 in x[0].split()
                           if not x0.lower() in x[1].lower().split()])

df['col1'] = df[['ADRESAT', 'NADAWCA']].apply(intersection, axis='columns')
df['col2'] = df[['ADRESAT', 'NADAWCA']].apply(difference, axis='columns')
>>> df
          ADRESAT     NADAWCA        col1 col2
0       Kowal Jan   Jan Kowal   Jan Kowal
1  Nowak Adam PHU  Adam Nowak  Adam Nowak  PHU

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM