[英]I need to drop all rows in a certain column where there is no value or is “null”: Using Python and Pandas
I need to drop all rows in a certain column where there is no value ie where it is "null".我需要删除某个列中没有值的所有行,即它是“null”的地方。 But the problem is that I do not know the name of the column.但问题是我不知道列的名称。 But know that it is the 5th column across so I have tired using some iloc methods like "notna" and "notnull"(see below).但是知道它是第 5 列,所以我已经厌倦了使用一些 iloc 方法,如“notna”和“notnull”(见下文)。 I have included a sample image of the type of data I am working with.我已经包含了我正在使用的数据类型的示例图像。 The reason I am trying to do this is because there is a varying number of junk rows at the top of my csv file/dataframe that I am trying to get rid of.我试图这样做的原因是因为在我试图摆脱的 csv 文件/数据帧的顶部有不同数量的垃圾行。 But the number of rows is different each time so I cannot use something that will just drop a certain known number of header rows.但是每次的行数都不同,所以我不能使用只会删除某个已知数量的 header 行的东西。 That is why I am trying to get rid of all null rows in a certain column because I know that it will also get rid of all the junk rows at the top of my dataset.这就是为什么我试图删除某个列中的所有 null 行,因为我知道它也会删除数据集顶部的所有垃圾行。
These are some methods I have tried using but they didn't work.这些是我尝试使用的一些方法,但没有奏效。
df = df[df[df.iloc[:, 4]].notna()]
df = df[pd.notnull(df[df.iloc[:, 4])]
df = df.dropna(subset=[df.iloc[:, 5]])
So for example here in this image I am trying to drop all rows where column 5 (the Date column) is null but that columns name is not "Date" yet because of the junk rows at the top.因此,例如,在此图像中,我试图删除第 5 列(日期列)为 null 但列名称不是“日期”的所有行,因为顶部有垃圾行。 I am trying to get rid of all the null rows in column 5 so that only the populated columns remain and the junk rows at the top will be eliminated:我正在尝试删除第 5 列中的所有 null 行,以便只保留填充的列,并消除顶部的垃圾行:
See the table here请参阅此处的表格
Your first two versions have an extra df[]
.您的前两个版本有一个额外的df[]
。 You can use either:您可以使用:
df = df[df.iloc[:, 4].notna()]
Or:或者:
df = df[pd.notnull(df.iloc[:, 4])]
To break it down more explicitly, these are using boolean indexing.为了更明确地分解它,这些使用 boolean 索引。 For example the first one uses df.iloc[:, 4].notna()
to get a boolean index of notna
and then filters df
with it:例如,第一个使用df.iloc[:, 4].notna()
获取 notna 的notna
索引,然后用它过滤df
:
notna_boolean_index = df.iloc[:, 4].notna()
df = df.loc[notna_boolean_index] # can also leave out `.loc` for boolean indexes
You can simply parse your data by passing na_values
and then do drop_na
.您可以通过传递na_values
来简单地解析您的数据,然后执行drop_na
。 To handle the junk rows at the top you can use skiprows
while reading the csv.要处理顶部的垃圾行,您可以在阅读 csv 时使用skiprows
。 Below is sample code that might help you achieve the above,下面是可以帮助您实现上述目标的示例代码,
Read csv,读取 csv,
df = pd.read_csv('/tmp/test.csv', na_values=['null'], keep_default_na=True, skiprows=3)
Although i believe null is taken by default as na value but you can use the above to be safe.虽然我相信 null 默认采用 na 值,但您可以使用上述内容来确保安全。
Then you can simple drop the na rows based on a column,然后你可以简单地删除基于列的 na 行,
df.drop_na(subset=column_name)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.