简体   繁体   English

在熊猫中用 NaN 替换空白值(空格)

[英]Replacing blank values (white space) with NaN in pandas

I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.我想在 Pandas 数据框中查找包含空格(任意数量)的所有值,并用 NaN 替换这些值。

Any ideas how this can be improved?任何想法如何改进?

Basically I want to turn this:基本上我想把这个:

                   A    B    C
2000-01-01 -0.532681  foo    0
2000-01-02  1.490752  bar    1
2000-01-03 -1.387326  foo    2
2000-01-04  0.814772  baz     
2000-01-05 -0.222552         4
2000-01-06 -1.176781  qux     

Into this:进入这个:

                   A     B     C
2000-01-01 -0.532681   foo     0
2000-01-02  1.490752   bar     1
2000-01-03 -1.387326   foo     2
2000-01-04  0.814772   baz   NaN
2000-01-05 -0.222552   NaN     4
2000-01-06 -1.176781   qux   NaN

I've managed to do it with the code below, but man is it ugly.我已经设法用下面的代码做到了,但人是丑陋的。 It's not Pythonic and I'm sure it's not the most efficient use of pandas either.它不是 Pythonic,我相信它也不是最有效地使用 Pandas。 I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.我遍历每一列,并对通过应用一个函数生成的列掩码进行布尔替换,该函数对每个值进行正则表达式搜索,匹配空白。

for i in df.columns:
    df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None

It could be optimized a bit by only iterating through fields that could contain empty strings:它可以通过仅迭代可能包含空字符串的字段来优化:

if df[i].dtype == np.dtype('object')

But that's not much of an improvement但这并没有太大的改善

And finally, this code sets the target strings to None, which works with Pandas' functions like fillna() , but it would be nice for completeness if I could actually insert a NaN directly instead of None .最后,这段代码将目标字符串设置为 None,这与 Pandas 的函数fillna() ,但如果我实际上可以直接插入NaN而不是None那么对于完整性会很好。

I think df.replace() does the job, since pandas 0.13 :我认为df.replace()完成这项工作,因为pandas 0.13

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^\s*$', np.nan, regex=True))

Produces:产生:

                   A    B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

As Temak pointed it out, use df.replace(r'^\\s+$', np.nan, regex=True) in case your valid data contains white spaces.正如Temak指出的那样,如果您的有效数据包含空格,请使用df.replace(r'^\\s+$', np.nan, regex=True)

If you want to replace an empty string and records with only spaces, the correct answer is !:如果你想用空格替换空字符串和记录,正确答案是!:

df = df.replace(r'^\s*$', np.nan, regex=True)

The accepted answer接受的答案

df.replace(r'\s+', np.nan, regex=True)

Does not replace an empty string!, you can try yourself with the given example slightly updated:不替换空字符串!,您可以尝试使用稍微更新的给定示例:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'fo o', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', ''],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

Note, also that 'fo o' is not replaced with Nan, though it contains a space.请注意,尽管 'fo o' 包含一个空格,但它不会被替换为 Nan。 Further note, that a simple:进一步注意,一个简单的:

df.replace(r'', np.NaN)

Does not work either - try it out.也不起作用 - 试试看。

How about:怎么样:

d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

The applymap function applies a function to every cell of the dataframe. applymap函数将函数应用于数据帧的每个单元格。

I did this:我这样做了:

df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

or或者

df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

You can strip all str, then replace empty str with np.nan .您可以np.nan所有 str,然后用np.nan替换空 str 。

If you are exporting the data from the CSV file it can be as simple as this :如果您从 CSV 文件导出数据,它可以像这样简单:

df = pd.read_csv(file_csv, na_values=' ')

This will create the data frame as well as replace blank values as Na这将创建数据框并将空白值替换为 Na

最简单的解决方案:

df = df.replace(r'^\s+$', np.nan, regex=True)

对于检查单个值是否相等的非常快速和简单的解决方案,您可以使用mask方法。

df.mask(df == ' ')

These are all close to the right answer, but I wouldn't say any solve the problem while remaining most readable to others reading your code.这些都接近正确的答案,但我不会说任何解决问题的方法,同时对其他阅读您的代码的人来说仍然是最易读的。 I'd say that answer is a combination of BrenBarn's Answer and tuomasttik's comment below that answer .我会说这个答案是BrenBarn's Answer和 tuomasttik 在该答案下方的评论的组合。 BrenBarn's answer utilizes isspace builtin, but does not support removing empty strings, as OP requested, and I would tend to attribute that as the standard use case of replacing strings with null. BrenBarn 的答案使用isspace内置,但不支持删除空字符串,如 OP 要求,我倾向于将其归因于用 null 替换字符串的标准用例。

I rewrote it with .apply , so you can call it on a pd.Series or pd.DataFrame .我用.apply重写了它,所以你可以在pd.Seriespd.DataFrame上调用它。


Python 3:蟒蛇3:

To replace empty strings or strings of entirely spaces:替换空字符串或完全空格的字符串:

df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:要替换完全空格的字符串:

df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)

To use this in Python 2, you'll need to replace str with basestring .要在 Python 2 中使用它,您需要将str替换为basestring

Python 2:蟒蛇2:

To replace empty strings or strings of entirely spaces:替换空字符串或完全空格的字符串:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:要替换完全空格的字符串:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

This worked for me.这对我有用。 When I import my csv file I added na_values = ' '.当我导入我的 csv 文件时,我添加了 na_values = ' '。 Spaces are not included in the default NaN values.默认 NaN 值中不包含空格。

df= pd.read_csv(filepath,na_values = ' ')
print(df.isnull().sum()) # check numbers of null value in each column

modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"

# modifiedDf = fd.dropna() # Remove rows with empty values

print(modifiedDf.isnull().sum()) # check numbers of null value in each column

This is not an elegant solution, but what does seem to work is saving to XLSX and then importing it back.这不是一个优雅的解决方案,但似乎有效的是保存到 XLSX 然后将其导入回来。 The other solutions on this page did not work for me, unsure why.此页面上的其他解决方案对我不起作用,不确定原因。

data.to_excel(filepath, index=False)
data = pd.read_excel(filepath)

This should work这应该工作

df.loc[df.Variable == '', 'Variable'] = 'Value'

or或者

df.loc[df.Variable1 == '', 'Variable2'] = 'Value'

我尝试了这段代码,它为我工作:df.applymap(lambda x:如果x ==“” else x,则为“ NaN”)

you can also use a filter to do it.你也可以使用过滤器来做到这一点。

df = PD.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '])
    df[df=='']='nan'
    df=df.astype(float)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM