简体   繁体   English

python从数据集中删除特殊值

[英]python remove special values from data set

I have a large data set of movies. 我有一大堆电影数据。 I am using Pandas package. 我正在使用Pandas包。

In the variable "budget" (is an object type) there is often the "?" 在变量“预算”(是一种对象类型)中,通常有“?” charachter. charachter。

Now I want to remove all movies, that contain a "?" 现在我要删除所有包含“?”的电影 in the budget variable. 在预算变量中。

In the end I want to convert the budget variable into a integer and run a regression with the variable "quality". 最后,我想将预算变量转换为整数,并使用变量“quality”运行回归。 Picture of the data set I have tried one method, but it doesn´t worked out. 数据集的图片我尝试了一种方法,但它没有成功。 while "?" 而“?” in df.budget: df.remove(?) 在df.budget中:df.remove(?)

Compare for not equal ( != ) numpy array created by values or convert all values to string s first and add all for check all True s values per rows, last filter by boolean indexing : 比较由values创建的不等于( !=numpy array或首先将所有值转换为string s并添加all以检查每行的所有True值,最后按boolean indexing过滤:

df = pd.DataFrame({'A':list('abcde?'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[4,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbb')})

print (df)
   A  B  C  D  E  F
0  a  4  7  4  5  a
1  b  5  8  3  3  a
2  c  4  9  5  6  a
3  d  5  4  7  9  b
4  e  5  2  1  2  b
5  ?  4  3  0  4  b

df = df[(df.values != '?').all(axis=1)]
#alternative
#df = df[(df.astype(str) != '?').all(axis=1)]
print (df)
   A  B  C  D  E  F
0  a  4  7  4  5  a
1  b  5  8  3  3  a
2  c  4  9  5  6  a
3  d  5  4  7  9  b
4  e  5  2  1  2  b

Details : 细节

print (df.values != '?')
[[ True  True  True  True  True  True]
 [ True  True  True  True  True  True]
 [ True  True  True  True  True  True]
 [ True  True  True  True  True  True]
 [ True  True  True  True  True  True]
 [False  True  True  True  True  True]]

print ((df.values != '?').all(axis=1))
[ True  True  True  True  True False]

EDIT: 编辑:

df = pd.DataFrame({'A':list('abcde?'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[4,3,5,7,1,'?'],
                   'E':['?',3,6,'?',2,4]}).astype(str)

print (df)
   A  B  C  D  E
0  a  4  7  4  ?
1  b  5  8  3  3
2  c  4  9  5  6
3  d  5  4  7  ?
4  e  5  2  1  2
5  ?  4  3  ?  4

#replace only in columns from list
cols = ['C','D','E']
#if only ? with numeric to NaNs
df[cols] = df[cols].replace('?', np.nan).astype(float)

#replace all non numeric to NaNs
#df[cols] = df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))

#replace NaNs by means
df = df.fillna(df.mean())
print (df)
   A  B    C    D     E
0  a  4  7.0  4.0  3.75
1  b  5  8.0  3.0  3.00
2  c  4  9.0  5.0  6.00
3  d  5  4.0  7.0  3.75
4  e  5  2.0  1.0  2.00
5  ?  4  3.0  4.0  4.00

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM