熊猫数据框用nan替换em-dash

Question

I am trying to read in a large number of .xls and .xlsx files with predominantly numeric data into python using pd.read_excel. 我试图使用pd.read_excel将大量以数字数据为主的.xls和.xlsx文件读入python。 However, the files use em-dash for missing values. 但是，文件使用em-dash表示缺少的值。 I am trying to get Python to replace all these em-dashes as nans. 我正在尝试让Python将所有这些破折号替换为nan。 I can't seem to find a way to get Python to even recognize the character, let alone replace it. 我似乎找不到找到使Python识别字符的方法，更不用说替换它了。 I tried the following which did not work 我尝试了以下无效的方法

df['var'].apply(lambda x: re.sub(u'\2014','',x))

I also tried simply 我也尝试过

df['var'].astype('float')

What would be the best way to get all the em-dashs in a dataframe to convert to nans, while keeping the numeric data as floats? 在将数值数据保持为浮点数的同时，将数据框中的所有破折号转换为nan的最佳方法是什么？

Answer 1

You should catch the error at an earlier stage. 你应该在更早的阶段捕获的错误。 Tell pd.read_excel() to treat em-dashes as NaNs: 告诉pd.read_excel()将破折号视为NaN：

df = pd.read_excel(..., na_values=['–','—'])

Answer 2

I think the most straightforward way to do this would be pd.to_numeric with the argument errors='coerce' : 我认为最简单的方法是将pd.to_numeric与参数errors='coerce' ：

df['var'] = pd.to_numeric(df['var'], errors='coerce')

From the docs : 从文档：

If 'coerce', then invalid parsing will be set as NaN 如果为“强制”，则将无效解析设置为NaN

Answer 3

df.replace({'-': None}) is what you are looking for. df.replace({'-': None})是您要寻找的。 Found in another post on stack overflow. 在堆栈溢出的另一篇文章中找到。

Answer 4

Not sure exactly what was going on with those dashes (which showed up like u'\–' when I would do df.get_value(0,'var')) but I did find a solution that worked, which converted the dashes to nans and kept the numeric data as numbers. 不知道这些破折号到底是怎么回事（当我执行df.get_value（0，'var'）时，它们显示为u'\\ u2013'），但是我确实找到了一个可行的解决方案，将破折号转换为nans并将数字数据保留为数字。

import unicodedata

df['var']=df['var'].map(unicode)
df['var']=df['var'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
df['var']=pd.to_numeric(df['var'])

熊猫数据框用nan替换em-dash

问题描述

4 个解决方案

解决方案1
4 2018-05-29 18:53:42

解决方案2
1 2018-05-29 18:51:14

解决方案3
0 2018-05-29 18:56:50

解决方案4
0 2018-05-29 21:12:28

熊猫数据框用nan替换em-dash

问题描述

4 个解决方案

解决方案1 4 2018-05-29 18:53:42

解决方案2 1 2018-05-29 18:51:14

解决方案3 0 2018-05-29 18:56:50

解决方案4 0 2018-05-29 21:12:28

解决方案1
4 2018-05-29 18:53:42

解决方案2
1 2018-05-29 18:51:14

解决方案3
0 2018-05-29 18:56:50

解决方案4
0 2018-05-29 21:12:28