检查两列之间的值

Question

I need to do the following steps on two columns -A and B- of my df and output the result in C:我需要在我的df和 output 的两列 -A 和 B- 上执行以下步骤，结果在 C 中：

1) check if value from B is present in A -on row, at any position
2) if present but in another format then remove
3) add value from B in A and output in C



A                          B                C
tshirt for women           TSHIRT           TSHIRT for women 
Zaino Estensibile          SJ Gang          SJ Gang Zaino Estensibile 
Air Optix plus             AIR OPTIX        AIR OPTIX plus

Workaround with concatenation between A and B and duplicate removal: A 和 B 之间的串联和重复删除的解决方法：

Version1版本 1

def uniqueList(row):
    words = str(row).split(" ")
    unique = words[0]
    for w in words:
        if w.lower() not in unique.lower() :
            if w.lower()not in my_list:
                unique = unique + " " + w

    return unique
    
df["C"] = df["C"].apply(uniqueList)

Version2版本2

sentences = df["B"] .to_list()
for s in sentences:
    s_split = s.split(' ')  # keep original sentence split by ' '
    s_split_without_comma = [i.strip(',') for i in s_split]
    # method 1: re
    compare_words = re.split(' |-', s)
    # method 2: itertools
    compare_words = list(itertools.chain.from_iterable([i.split('-') for i in s_split]))
    method 3: DIY
    compare_words = []
    for i in s_split:
        compare_words += i.split('-')

    # strip ','
    compare_words_without_comma = [i.strip(',') for i in compare_words]

    start to compare
    need_removed_index = []
    for word in compare_words_without_comma:
        matched_indexes = []
        for idx, w in enumerate(s_split_without_comma):
            if word.lower() in w.lower().split('-'):
                matched_indexes.append(idx)
        if len(matched_indexes) > 1:  # has_duplicates
            need_removed_index += matched_indexes[1:]
    need_removed_index = list(set(need_removed_index))

    # keep remain and join with ' '
    print(" ".join([i for idx, i in enumerate(s_split) if idx not in need_removed_index]))
    # print(sentences)

print(sentences)

None of this are working properly as is not the best way to approach.这些都不能正常工作，因为这不是最好的方法。

Answer 1

Using sets, get strings in A not in B .Put these strings in column C as a set使用集合，获取A中而不是B中的字符串。将这些字符串作为集合C列中

  df['C'] = [(set(a).difference(b)) for a, b in zip(df['A'].str.upper().str.split('\s'), df['B'].str.upper().str.split('\s'))]

Strip of the new column C the set brackets and the comma and concatenate with column B if B is a substring of A. If not, just concatenate B and A.如果 B 是 A 的 substring，则去掉新列C中的括号和逗号并与列B concatenate 。如果不是，则连接 B 和 A。

Code below;下面的代码；

df['C']= np.where([a in b for a, b in zip(df.B.str.lower(),df.A.str.lower())], df['B'] + ' ' + df['C'].str.join(',').str.replace(',',' ').str.lower(), df['B'] + ' ' + df['A'])

print(df)打印（df）

Output Output

               A          B                          C
0   tshirt for women     TSHIRT           TSHIRT for women
1  Zaino Estensibile    SJ Gang  SJ Gang Zaino Estensibile
2     Air Optix plus  AIR OPTIX             AIR OPTIX plus

Answer 2

Here's a solution using regular expressions, assuming that df is the name of the dataframe.这是一个使用正则表达式的解决方案，假设df是 dataframe 的名称。

So the idea is simple, if B has something in A, replace it with B's value.所以思路很简单，如果B在A里面有东西，就用B的值代替。 Else return string B + A.否则返回字符串 B + A。

import re

def create_c(row):
    if re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE) == row['A']:
        return row['B'] + ' ' + row['A']
    return re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE)


df['C'] = df.apply(create_c, axis=1)

Edit #1 : I forgot to add the return keyword before the re.sub() statement.编辑 #1 ：我忘记在 re.sub() 语句之前添加return关键字。

Here's running the code in the shell:这是运行 shell 中的代码：

>>> import pandas as pd
>>> data = [['tshirt for women', 'TSHIRT'], ['Zaino Estensibile', 'SJ Gang']]
>>> df = pd.DataFrame(data, columns=['A', 'B'])
>>> df
                   A        B
0   tshirt for women   TSHIRT
1  Zaino Estensibile  SJ Gang
>>> 
>>>
>>> import re
>>> def create_c(row):
...     if re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE) == row['A']:
...         return row['B'] + ' ' + row['A']
...     return re.sub(row['B'], row['B'], row['A'], flags=re.IGNORECASE)
... 
>>> 
>>> df['C'] = df.apply(create_c, axis=1)
>>> df
                   A        B                          C
0   tshirt for women   TSHIRT           TSHIRT for women
1  Zaino Estensibile  SJ Gang  SJ Gang Zaino Estensibile
>>>

检查两列之间的值

问题描述

2 个解决方案

解决方案1
2 2021-09-27 23:24:46

解决方案2
0 已采纳 2021-09-27 22:18:10

检查两列之间的值

问题描述

2 个解决方案

解决方案1 2 2021-09-27 23:24:46

解决方案2 0 已采纳 2021-09-27 22:18:10

解决方案1
2 2021-09-27 23:24:46

解决方案2
0 已采纳 2021-09-27 22:18:10