简体   繁体   English

如何比较csv文件的三列并使用python确定丢失的数据?

[英]How to compare three columns of a csv file and determine the missing data with python?

I have a CSV file (screenshot attached)我有一个 CSV 文件(附截图)数据集

There I have three different columns and the first one, "Set of Ids known for transformers" is the master column.在那里我有三个不同的列,第一个“变压器已知的 Id 集”是主列。 I need to compare the other two columns with the master column and filter out the missing values in the remaining two columns.我需要将其他两列与主列进行比较,并过滤掉其余两列中的缺失值。

Can anyone please tell me how to do this thing in python with the pandas library?谁能告诉我如何在 python 中使用 pandas 库来做这件事?

Thanks in advance.提前致谢。

You can remove any row missing both nonmaster columns using a condition to select rows:您可以使用条件选择行来删除缺少两个非主列的任何行:

import pandas as pd
df = pd.read_excel('soxl.xlsx')
df=df[df['IDs of phase known'].notnull() | df['Ids of distance known'].notnull()]

If you wanted to remove any row with any missing data you can use the builtin :如果你想删除任何缺少数据的行,你可以使用内置:

df = df.dropna()

which removes any row with missing values (which otherwise get imported as Nan values).它删除任何具有缺失值的行(否则将作为 Nan 值导入)。

If you want to replace the 2nd col.如果你想更换第二列。 with master column values when they are missing, you can do缺少主列值时,您可以执行

df = df.where((pd.notnull(df)), None)

to replace the NaN's with None (useful in the next step) and then用 None 替换 NaN(在下一步中很有用),然后

df['IDs of phase known']= df.apply(lambda r:r['IDs of phase known'] if r['IDs of phase known'] else r['Set of Ids known for transformers'],axis=1)

And of course you can do the same with the 3rd col.当然,您可以对第 3 列执行相同操作。 If you want to replace only in the case where both col2 and 3 values are missing, you can do similarly, but check both columns for None:如果您只想在 col2 和 3 值都丢失的情况下进行替换,您可以执行类似的操作,但检查两列是否为 None:

df['IDs of phase known']= df.apply(lambda r:r['Set of Ids known for transformers'] if not (r['IDs of phase known'] or r['Ids of distance known']) else r['IDs of phase known'] ,axis=1 )

You can also do this, actually a bit easier, with numpy:你也可以这样做,实际上更容易一些,使用 numpy:

df['IDs of phase known'] = np.where(np.isnan(df['IDs of phase known']), df['Set of Ids known for transformers'], df['IDs of phase known'])

If you want to just replace missing values with master column vals, or如果您只想用主列 val 替换缺失值,或者

df['IDs of phase known'] = np.where( (np.isnan(df['IDs of phase known']) & (np.isnan(df['Ids of distance known']))),df['Set of Ids known for transformers'],df['IDs of phase known'])

If you only want to replace in cases where both cols are missing.如果您只想在缺少两个列的情况下进行替换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM