[英]Unable to remove rows from dataframe based on condition
So i have a dataframe, df:所以我有一个 dataframe,df:
Rank Name Platform ... JP_Sales Other_Sales Global_Sales
0 1 Wii Sports Wii ... 3.77 8.46 82.74
1 2 Super Mario Bros. NES ... 6.81 0.77 40.24
2 3 Mario Kart Wii Wii ... 3.79 3.31 35.82
3 4 Wii Sports Resort Wii ... 3.28 2.96 33.00
4 5 Pokemon Red/Pokemon Blue GB ... 10.22 1.00 31.37
... ... ... ... ... ... ... ...
16593 16596 Woody Woodpecker in Crazy Castle 5 GBA ... 0.00 0.00 0.01
16594 16597 Men in Black II: Alien Escape GC ... 0.00 0.00 0.01
16595 16598 SCORE International Baja 1000: The Official Game PS2 ... 0.00 0.00 0.01
16596 16599 Know How 2 DS ... 0.00 0.00 0.01
16597 16600 Spirits & Spells GBA ... 0.00 0.00 0.01
I used df.describe
and it shows that the year count is less than the others:我使用
df.describe
,它显示年份计数少于其他年份:
So i thought that some values in Year are empty.所以我认为 Year 中的一些值是空的。 tried doing
df.dropna()
but that didnt work.尝试做
df.dropna()
但这没有用。
I then tried printing the values of the column Year which were not numbers with this code (Probably not the best code but it works) along with the type()
:然后,我尝试使用此代码(可能不是最好的代码,但它有效)与
type()
一起打印不是数字的 Year 列的值:
with open("vgsales.csv", "r") as csv_file:
rows = csv_file.read().split("\n")
row_components = [row.split(",") for row in rows if len(row) > 0]
data_dict = {header:[] for header in row_components[0]}
for header_index, header in enumerate(row_components[0]):
print("header_index: ", header_index)
for row_index, row in enumerate(row_components[1:]):
data_dict[header].append(row[header_index])
for i in data_dict["Year"]:
if not i.isdigit():
print(i, type(i))
The output (same output repeated a lot): output(同样的output重复了很多):
N/A <class 'str'>
So then i tried the answers i found in this stackoverflow question: df = df[df.Year != "N/A"]
and it didnt work either所以然后我尝试了我在这个stackoverflow问题中找到的答案:
df = df[df.Year != "N/A"]
它也没有工作
Also tried df = df.drop(df[(df.Year == "N/A")].index)
and it didnt work还尝试
df = df.drop(df[(df.Year == "N/A")].index)
但它没有用
So then i thought Why dont i open it in excel and see what values are there when it is not a year.所以然后我想为什么我不在 excel 中打开它,看看不是一年时有什么值。 Indeed it was
N/A
确实是
N/A
Any ideas what i can do?有什么想法我能做什么? I want to clean the data so that all the columns have the same count for a machine learning project
我想清理数据,以便机器学习项目的所有列具有相同的计数
First off, it's important to know why you're missing data, and to see if you can possibly impute rather than just drop.首先,重要的是要知道为什么你会丢失数据,并看看你是否可以估算而不是仅仅放弃。
If you still want to drop, you can use df = df.dropna(how='any')
.如果您仍想放弃,可以使用
df = df.dropna(how='any')
。
The reason why Excel shows "N/A" as the value for missing data is because that's Excel's way of showing missing data. Excel 显示“N/A”作为缺失数据的值的原因是因为这是 Excel 显示缺失数据的方式。 It doesn't mean that the value of the cell that is missing data is
N/A
--that would be a string containing an N, a slash, and an A. Instead, you can try df = df[~df['Year'].isnull()]
as an alternative method for selecting non-null values.这并不意味着缺少数据的单元格的值是
N/A
A——这将是一个包含 N、斜杠和 A 的字符串。相反,您可以尝试df = df[~df['Year'].isnull()]
作为选择非空值的替代方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.