[英]Lost data in pre-processing data using Python, numpy and panda
I am operating an UCI data set file URL , and using Python to clean the data with "?" 我正在操作UCI数据集文件URL ,并使用Python使用“?”清除数据。 in the lines. 在行中。 The data contains 303 instance, and 6 lines contain "?". 数据包含303个实例,而6行包含“?”。 My code for cleaning the data is below: 我用于清理数据的代码如下:
import numpy as np
import pandas as pd
import scipy as sp
def dataGen():
infile = open('d:\\Data\\processed.cleveland.data',"r")
outfile = open('d:\\Data\\clean.processed.cleveland.data',"w")
lines = infile.readlines()
for line in lines:
if '?' not in line:
outfile.write(line)
dataset = np.asarray(pd.read_csv('d:\\Data\\clean.processed.cleveland.data', header=None))
return dataset
However, I got only 269 instances after the cleaning. 但是,清洁后我只得到269个实例。 I print the dataset
and found that the last line (269th line) is: 我打印dataset
,发现最后一行(第269行)为:
268 46 1 4 140 311 0 0 120 1 1.0 NaN NaN NaN NaN
I do not know what happens to the program. 我不知道程序会发生什么。 I checked output file, it shows the data correctly. 我检查了输出文件,它正确显示了数据。
All you're doing is skipping rows that have ?
您要做的就是跳过具有的行?
so you can just filter these out using apply
: 因此您可以使用apply
过滤掉它们:
In [11]:
import io
import pandas as pd
t="""56.0,1.0,2.0,130.0,221.0,0.0,2.0,163.0,0.0,0.0,1.0,0.0,7.0,0
58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[11]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 56 1 2 130 221 0 2 163 0 0.0 1 0.0 7 0
1 58 1 2 125 220 0 0 144 0 0.4 2 ? 7 0
2 57 0 2 130 236 0 2 174 0 0.0 2 1.0 3 1
3 38 1 3 138 175 0 0 173 0 0.0 1 ? 3 0
In [14]:
df[~df.apply(lambda x: x== '?', axis=1).any(axis=1)]
Out[14]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 56 1 2 130 221 0 2 163 0 0 1 0.0 7 0
2 57 0 2 130 236 0 2 174 0 0 2 1.0 3 1
So in your case the following should work: 因此,在您的情况下,以下方法应该起作用:
infile = pd.read_csv('d:\\Data\\processed.cleveland.data', header=None)
outfile = infile[~infile.apply(lambda x: x== '?', axis=1).any(axis=1)]
dataset = np.asarray(outfile)
I found the problem. 我发现了问题。 I should add a sentence outfile.close()
before read the new file. 在读取新文件之前,我应该添加一个句子outfile.close()
。 Now the problem is solved. 现在问题解决了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.