[英]Clean pandas dataframe column, remove parts from strings that are presented in some other dataframe
I am trying to clean up data in one dataframe by values from other dataframe's column.我正在尝试通过其他数据框列中的值清理 dataframe 中的数据。 The first dataframe contains a semicolon seperated list of values, the second dataframe contains single words.第一个 dataframe 包含分号分隔的值列表,第二个 dataframe 包含单个单词。 After cleaning the first dataframe must not contain any words from the second dataframe.清理后第一个 dataframe 不得包含来自第二个 dataframe 的任何单词。
data df1 data df2
x1;x2;x3 x1
key2;key6;key7;key8 x2
key6
key8
I need to remove from data df1
, values present in data df2
.我需要从数据df1
中删除数据df2
中存在的值。 I am trying to convert two columns from different dfs, into two lists and remove from list1
of df1
, the values present in list2
of df2
.我正在尝试将来自不同 dfs 的两列转换为两个列表,并从df1
的list1
中删除df2
的list2
中存在的值。
Is there a faster way of doing this without a loop considering that data df2
column may have over 1M rows and in data df1
column I have more than one value on a row?考虑到数据df2
列可能有超过 1M 行并且在数据df1
列中我在一行上有多个值,是否有更快的方法来执行此操作?
You can essentially do this by splitting your dataframe's colum into enough columns, and replace values:您基本上可以通过将数据框的列拆分为足够的列并替换值来做到这一点:
import pandas as pd
df1 = pd.DataFrame({"a": ["x1;x2;x3", "key2;key6;key7;key8"]})
df2 = pd.DataFrame({"tbd": "x1,x2,key6,key8".split(",")})
print(df1)
print(df2)
# create a new df that contains splitted values as columns
df3 = df1["a"].str.split(";", expand=True).fillna(value="")
print(df3)
# replace non wanted values
df3.replace( df2["tbd"],"", inplace=True)
df3.replace( df3, None, "", inplace=True)
print(df3)
Output: Output:
# df1
a
0 x1;x2;x3
1 key2;key6;key7;key8
# df2
tbd
0 x1
1 x2
2 key6
3 key8
# df3 (df1 column "a" after splitting into new df)
0 1 2 3
0 x1 x2 x3
1 key2 key6 key7 key8
# replacing all values from df3 that are in df2["tbd"]
0 1 2 3
0 x3
1 key2 key6 key7
You may need to collect the data again.您可能需要再次收集数据。
To clean df1
in one go you can use a list comprehension like so:要在一个 go 中清理df1
,您可以使用如下列表理解:
import pandas as pd
df1 = pd.DataFrame({"a": ["x1;x2;x3", "key2;key6;key7;key8"]})
df2 = pd.DataFrame({"tbd": "x1,x2,key6,key8".split(",")})
df1["a"] = [';'.join([i for i in v.split(";") # split and recombine again
if i not in frozenset(df2["tbd"])]) # remove i from v if in df2
for v in df1["a"]] # v == any rows of column
print(df1)
a
0 x3
1 key2;key7
This solution could have been found as combination of answers from splitting a column by delimiter pandas python and Remove unwanted parts from strings in a column - but it is nto a pure duplicate of either.该解决方案可以作为通过分隔符 pandas python和从列中的字符串中删除不需要的部分的答案的组合来找到 - 但它不是任何一个的纯副本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.