简体   繁体   English

清洁 pandas dataframe 列,从其他一些 dataframe 中出现的字符串中删除部分

[英]Clean pandas dataframe column, remove parts from strings that are presented in some other dataframe

I am trying to clean up data in one dataframe by values from other dataframe's column.我正在尝试通过其他数据框列中的值清理 dataframe 中的数据。 The first dataframe contains a semicolon seperated list of values, the second dataframe contains single words.第一个 dataframe 包含分号分隔的值列表,第二个 dataframe 包含单个单词。 After cleaning the first dataframe must not contain any words from the second dataframe.清理后第一个 dataframe 不得包含来自第二个 dataframe 的任何单词。

data df1                                       data df2

x1;x2;x3                                       x1
key2;key6;key7;key8                            x2
                                               key6  
                                               key8

I need to remove from data df1 , values present in data df2 .我需要从数据df1中删除数据df2中存在的值。 I am trying to convert two columns from different dfs, into two lists and remove from list1 of df1 , the values present in list2 of df2 .我正在尝试将来自不同 dfs 的两列转换为两个列表,并从df1list1中删除df2list2中存在的值。

Is there a faster way of doing this without a loop considering that data df2 column may have over 1M rows and in data df1 column I have more than one value on a row?考虑到数据df2列可能有超过 1M 行并且在数据df1列中我在一行上有多个值,是否有更快的方法来执行此操作?

You can essentially do this by splitting your dataframe's colum into enough columns, and replace values:您基本上可以通过将数据框的列拆分为足够的列并替换值来做到这一点:

import pandas as pd

df1 = pd.DataFrame({"a": ["x1;x2;x3", "key2;key6;key7;key8"]})
df2 = pd.DataFrame({"tbd": "x1,x2,key6,key8".split(",")})

print(df1)
print(df2)
# create a new df that contains splitted values as columns
df3 = df1["a"].str.split(";", expand=True).fillna(value="")
print(df3)

# replace non wanted values
df3.replace( df2["tbd"],"", inplace=True)
df3.replace( df3, None, "", inplace=True)
print(df3)

Output: Output:

# df1
    a
0             x1;x2;x3
1  key2;key6;key7;key8

# df2
    tbd
0    x1
1    x2
2  key6
3  key8

# df3 (df1 column "a" after splitting into new df)
      0     1     2     3
0    x1    x2    x3  
1  key2  key6  key7  key8

# replacing all values from df3 that are in df2["tbd"]
      0     1     2     3
0                x3  
1  key2  key6  key7

You may need to collect the data again.您可能需要再次收集数据。


To clean df1 in one go you can use a list comprehension like so:要在一个 go 中清理df1 ,您可以使用如下列表理解:

import pandas as pd

df1 = pd.DataFrame({"a": ["x1;x2;x3", "key2;key6;key7;key8"]})
df2 = pd.DataFrame({"tbd": "x1,x2,key6,key8".split(",")})


df1["a"] = [';'.join([i for i in v.split(";")             # split and recombine again
                      if i not in frozenset(df2["tbd"])]) # remove i from v if in df2
            for v in df1["a"]]                            # v == any rows of column

print(df1)


          a
0         x3
1  key2;key7

This solution could have been found as combination of answers from splitting a column by delimiter pandas python and Remove unwanted parts from strings in a column - but it is nto a pure duplicate of either.该解决方案可以作为通过分隔符 pandas python从列中的字符串中删除不需要的部分的答案的组合来找到 - 但它不是任何一个的纯副本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM