简体   繁体   English

熊猫数据框用nan替换em-dash

[英]Pandas dataframe replace em-dash with nan

I am trying to read in a large number of .xls and .xlsx files with predominantly numeric data into python using pd.read_excel. 我试图使用pd.read_excel将大量以数字数据为主的.xls和.xlsx文件读入python。 However, the files use em-dash for missing values. 但是,文件使用em-dash表示缺少的值。 I am trying to get Python to replace all these em-dashes as nans. 我正在尝试让Python将所有这些破折号替换为nan。 I can't seem to find a way to get Python to even recognize the character, let alone replace it. 我似乎找不到找到使Python识别字符的方法,更不用说替换它了。 I tried the following which did not work 我尝试了以下无效的方法

df['var'].apply(lambda x: re.sub(u'\2014','',x))

I also tried simply 我也尝试过

df['var'].astype('float')

What would be the best way to get all the em-dashs in a dataframe to convert to nans, while keeping the numeric data as floats? 在将数值数据保持为浮点数的同时,将数据框中的所有破折号转换为nan的最佳方法是什么?

You should catch the error at an earlier stage. 你应该在更早的阶段捕获的错误。 Tell pd.read_excel() to treat em-dashes as NaNs: 告诉pd.read_excel()将破折号视为NaN:

df = pd.read_excel(..., na_values=['–','—'])

I think the most straightforward way to do this would be pd.to_numeric with the argument errors='coerce' : 我认为最简单的方法是将pd.to_numeric与参数errors='coerce'

df['var'] = pd.to_numeric(df['var'], errors='coerce')

From the docs : 文档

If 'coerce', then invalid parsing will be set as NaN 如果为“强制”,则将无效解析设置为NaN

df.replace({'-': None}) is what you are looking for. df.replace({'-': None})是您要寻找的。 Found in another post on stack overflow. 在堆栈溢出的另一篇文章中找到。

Not sure exactly what was going on with those dashes (which showed up like u'\–' when I would do df.get_value(0,'var')) but I did find a solution that worked, which converted the dashes to nans and kept the numeric data as numbers. 不知道这些破折号到底是怎么回事(当我执行df.get_value(0,'var')时,它们显示为u'\\ u2013'),但是我确实找到了一个可行的解决方案,将破折号转换为nans并将数字数据保留为数字。

import unicodedata

df['var']=df['var'].map(unicode)
df['var']=df['var'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
df['var']=pd.to_numeric(df['var'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM