[英]Remove duplicates in a row based on column value
嗨,找不到任何關於此的具體內容,抱歉,如果它是重復的...
如何刪除包含相同信息的單行的列值(有一些例外)
例子:
Name Age Job How_Old Occupation Happy Married?
0 John 35 Dev 35 Dev True True
1 Sally 42 CA 42 CA False False
我想刪除包含相同信息的不同名稱的列,除了包含一些明顯重復的列,如二進制列。
Output:
Name Age Job Happy Married?
0 John 35 Dev True True
1 Sally 42 CA False False
謝謝,還請注意,我需要在 massvie flattend 和標准化 json 文件上執行此操作,因此循環將非常耗時。
First exlude boolean columns by DataFrame.select_dtypes
, transpose and get duplicates by DataFrame.duplicated
per all rows, then invert mask by ~
and add removed boolean columns by Series.reindex
, last is filtered by DataFrame.loc
for all rows by first :
and按掩碼的列名稱:
m = (~df.select_dtypes(exclude=bool).T.duplicated()).reindex(df.columns, fill_value=True)
另一個想法是將值轉換為元組並調用Series.duplicated
:
m = ((~df.select_dtypes(exclude=bool).apply(tuple).duplicated())
.reindex(df.columns, fill_value=True))
df = df.loc[:, m]
print (df)
Name Age Job Happy Married?
0 John 35 Dev True True
1 Sally 42 CA False False
詳情:
#exlude boolean columns
print (df.select_dtypes(exclude=bool))
Name Age Job How_Old Occupation
0 John 35 Dev 35 Dev
1 Sally 42 CA 42 CA
#transpose
print (df.select_dtypes(exclude=bool).T)
0 1
Name John Sally
Age 35 42
Job Dev CA
How_Old 35 42
Occupation Dev CA
#checked duplicates per all columns
print (df.select_dtypes(exclude=bool).T.duplicated())
Name False
Age False
Job False
How_Old True
Occupation True
#inverse mask True->False, False->True
print ((~df.select_dtypes(exclude=bool).T.duplicated()))
Name True
Age True
Job True
How_Old False
Occupation False
dtype: bool
#added removed boolean columns with Trues
print ((~df.select_dtypes(exclude=bool).T.duplicated())
.reindex(df.columns, fill_value=True))
Name True
Age True
Job True
How_Old False
Occupation False
Happy True
Married? True
dtype: bool
定義如下 function,返回要刪除的列名列表:
def chkColToDel(df):
# Column names excluding bool columns
cols = df.select_dtypes(exclude=bool).columns.tolist()
colsToDel = []
while len(cols) > 1:
cn1 = cols.pop(0) # Column name, left side
if cn1 not in colsToDel: # Not marked for deletion earlier
c1 = df[cn1] # The column itself
t1 = c1.dtype.name # Type name
for cn2 in cols: # Check remaining columns
c2 = df[cn2] # Column name, right side
if t1 == c2.dtype.name and c1.equals(c2):
# Same types and equal values
colsToDel.append(cn2) # Mark for deletion
return colsToDel
然后調用它:
colsToDel = chkColToDel(df)
剩下的唯一事情是刪除返回的列,如果有的話:
if len(colsToDel) > 0:
df.drop(columns=colsToDel, inplace=True)
我假設您的帖子中提到的一些例外實際上是指bool列。 如果異常列表更廣泛,請相應地更改我的代碼。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.