使用Python，numpy和panda预处理数据中的数据丢失

Question

I am operating an UCI data set file URL , and using Python to clean the data with "?" 我正在操作UCI数据集文件URL ，并使用Python使用“？”清除数据。 in the lines. 在行中。 The data contains 303 instance, and 6 lines contain "?". 数据包含303个实例，而6行包含“？”。 My code for cleaning the data is below: 我用于清理数据的代码如下：

import numpy as np 
import pandas as pd
import scipy as sp

def dataGen():
    infile = open('d:\\Data\\processed.cleveland.data',"r")
    outfile = open('d:\\Data\\clean.processed.cleveland.data',"w")
    lines = infile.readlines()
    for line in lines:
        if '?' not in line:
            outfile.write(line)
    dataset = np.asarray(pd.read_csv('d:\\Data\\clean.processed.cleveland.data', header=None))
return dataset

However, I got only 269 instances after the cleaning. 但是，清洁后我只得到269个实例。 I print the dataset and found that the last line (269th line) is: 我打印dataset ，发现最后一行（第269行）为：

268  46   1   4  140  311   0   0  120   1  1.0 NaN NaN NaN NaN

I do not know what happens to the program. 我不知道程序会发生什么。 I checked output file, it shows the data correctly. 我检查了输出文件，它正确显示了数据。

Answer 1

All you're doing is skipping rows that have ? 您要做的就是跳过具有的行? so you can just filter these out using apply : 因此您可以使用apply过滤掉它们：

In [11]:
import io
import pandas as pd

t="""56.0,1.0,2.0,130.0,221.0,0.0,2.0,163.0,0.0,0.0,1.0,0.0,7.0,0
58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0"""

df = pd.read_csv(io.StringIO(t), header=None)
df

Out[11]:
   0   1   2    3    4   5   6    7   8    9   10   11  12  13
0  56   1   2  130  221   0   2  163   0  0.0   1  0.0   7   0
1  58   1   2  125  220   0   0  144   0  0.4   2    ?   7   0
2  57   0   2  130  236   0   2  174   0  0.0   2  1.0   3   1
3  38   1   3  138  175   0   0  173   0  0.0   1    ?   3   0

In [14]:
df[~df.apply(lambda x: x== '?', axis=1).any(axis=1)]

Out[14]:
   0   1   2    3    4   5   6    7   8   9   10   11  12  13
0  56   1   2  130  221   0   2  163   0   0   1  0.0   7   0
2  57   0   2  130  236   0   2  174   0   0   2  1.0   3   1

So in your case the following should work: 因此，在您的情况下，以下方法应该起作用：

infile = pd.read_csv('d:\\Data\\processed.cleveland.data', header=None)
outfile = infile[~infile.apply(lambda x: x== '?', axis=1).any(axis=1)]
dataset = np.asarray(outfile)

Answer 2

I found the problem. 我发现了问题。 I should add a sentence outfile.close() before read the new file. 在读取新文件之前，我应该添加一个句子outfile.close() 。 Now the problem is solved. 现在问题解决了。

使用Python，numpy和panda预处理数据中的数据丢失

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-11-30 09:20:06

解决方案2
0 2015-11-30 08:54:53

使用Python，numpy和panda预处理数据中的数据丢失

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-11-30 09:20:06

解决方案2 0 2015-11-30 08:54:53

解决方案1
1 已采纳 2015-11-30 09:20:06

解决方案2
0 2015-11-30 08:54:53