简体   繁体   English

使用Python,numpy和panda预处理数据中的数据丢失

[英]Lost data in pre-processing data using Python, numpy and panda

I am operating an UCI data set file URL , and using Python to clean the data with "?" 我正在操作UCI数据集文件URL ,并使用Python使用“?”清除数据。 in the lines. 在行中。 The data contains 303 instance, and 6 lines contain "?". 数据包含303个实例,而6行包含“?”。 My code for cleaning the data is below: 我用于清理数据的代码如下:

import numpy as np 
import pandas as pd
import scipy as sp

def dataGen():
    infile = open('d:\\Data\\processed.cleveland.data',"r")
    outfile = open('d:\\Data\\clean.processed.cleveland.data',"w")
    lines = infile.readlines()
    for line in lines:
        if '?' not in line:
            outfile.write(line)
    dataset = np.asarray(pd.read_csv('d:\\Data\\clean.processed.cleveland.data', header=None))
return dataset

However, I got only 269 instances after the cleaning. 但是,清洁后我只得到269个实例。 I print the dataset and found that the last line (269th line) is: 我打印dataset ,发现最后一行(第269行)为:

268  46   1   4  140  311   0   0  120   1  1.0 NaN NaN NaN NaN

I do not know what happens to the program. 我不知道程序会发生什么。 I checked output file, it shows the data correctly. 我检查了输出文件,它正确显示了数据。

All you're doing is skipping rows that have ? 您要做的就是跳过具有的行? so you can just filter these out using apply : 因此您可以使用apply过滤掉它们:

In [11]:
import io
import pandas as pd
​
t="""56.0,1.0,2.0,130.0,221.0,0.0,2.0,163.0,0.0,0.0,1.0,0.0,7.0,0
58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0"""
​
df = pd.read_csv(io.StringIO(t), header=None)
df

Out[11]:
   0   1   2    3    4   5   6    7   8    9   10   11  12  13
0  56   1   2  130  221   0   2  163   0  0.0   1  0.0   7   0
1  58   1   2  125  220   0   0  144   0  0.4   2    ?   7   0
2  57   0   2  130  236   0   2  174   0  0.0   2  1.0   3   1
3  38   1   3  138  175   0   0  173   0  0.0   1    ?   3   0

In [14]:
df[~df.apply(lambda x: x== '?', axis=1).any(axis=1)]

Out[14]:
   0   1   2    3    4   5   6    7   8   9   10   11  12  13
0  56   1   2  130  221   0   2  163   0   0   1  0.0   7   0
2  57   0   2  130  236   0   2  174   0   0   2  1.0   3   1

So in your case the following should work: 因此,在您的情况下,以下方法应该起作用:

infile = pd.read_csv('d:\\Data\\processed.cleveland.data', header=None)
outfile = infile[~infile.apply(lambda x: x== '?', axis=1).any(axis=1)]
dataset = np.asarray(outfile)

I found the problem. 我发现了问题。 I should add a sentence outfile.close() before read the new file. 在读取新文件之前,我应该添加一个句子outfile.close() Now the problem is solved. 现在问题解决了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM