简体   繁体   中英

How to check whether the string value of a Pandas Column is contained in the string value of another Pandas Column?

I have a large dataframe with a format like this:

| ID       | A                 | B                       |
| -------- | ----------------- | ----------------------- |
| 0        | Tenure: Leasehold | Modern;Tenure: Leasehold|
| 1        | First Floor       | Refurbished             |
| 2        | NaN               | Modern                  |
| 3        | First Floor       | NaN                     |

I want to remove redundancies between columns A and B before merging them. So I would like to check if the value in column A is contained in column B: if yes, column A should take the value of column B, if no, the value of column A should remain the same.

I tried the following lambda function:

df['A'] = df['A'].apply(lambda x: df.B if df.A in df.B else df.A)

But I get this error: TypeError: 'Series' objects are mutable, thus they cannot be hashed

Then I tried with the np.where method like this:

df['A'] = np.where((df.A.values in df.B.values), df.B, df.A)

I can run the code, but it returns false for all the columns, so I don't get any modification in my DataFrame.

If I run the following code, it returns True however, so I know that the problem does not come from the data:

df.loc[0, 'A'] in df.loc[0, 'B']

I tried to modify this code and use it like that:

df['A'] = np.where((df.loc[:, 'A'] in df.loc[:, 'B']), df.B, df.A)

But then I get the same TypeError as above.

How can I solve this issue?

df["A"] = df.apply(lambda x: x["B"] if x["A"] in x["B"] else x["A"], axis=1)
print(df)

Prints:

   ID                         A                         B
0   0  Modern;Tenure: Leasehold  Modern;Tenure: Leasehold
1   1               First Floor               Refurbished

EDIT: To handle NaN s:

df["A"] = df.apply(
    lambda x: x["B"]
    if pd.notna(x["A"]) and pd.notna(x["B"]) and x["A"] in x["B"]
    else x["A"],
    axis=1,
)
print(df)

Prints:

   ID                         A                         B
0   0  Modern;Tenure: Leasehold  Modern;Tenure: Leasehold
1   1               First Floor               Refurbished
2   2                       NaN                    Modern
3   3               First Floor                       NaN

If you want to fill NaN s in column "A" :

df.loc[df["A"].isna(), "A"] = df.loc[df["A"].isna(), "B"]
print(df)

Prints:

   ID                         A                         B
0   0  Modern;Tenure: Leasehold  Modern;Tenure: Leasehold
1   1               First Floor               Refurbished
2   2                    Modern                    Modern
3   3               First Floor                       NaN

I would do a list comprehension with zip which would be quite fast when compared to pandas apply as the size of the dataframe increases:

df["A"] = [b if a in b else a for a,b in zip(df['A'],df['B'])]

print(df)

   ID                         A                         B
0   0  Modern;Tenure: Leasehold  Modern;Tenure: Leasehold
1   1               First Floor               Refurbished

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM