[英]Pandas dataframe replace em-dash with nan
I am trying to read in a large number of .xls and .xlsx files with predominantly numeric data into python using pd.read_excel. 我试图使用pd.read_excel将大量以数字数据为主的.xls和.xlsx文件读入python。 However, the files use em-dash for missing values.
但是,文件使用em-dash表示缺少的值。 I am trying to get Python to replace all these em-dashes as nans.
我正在尝试让Python将所有这些破折号替换为nan。 I can't seem to find a way to get Python to even recognize the character, let alone replace it.
我似乎找不到找到使Python识别字符的方法,更不用说替换它了。 I tried the following which did not work
我尝试了以下无效的方法
df['var'].apply(lambda x: re.sub(u'\2014','',x))
I also tried simply 我也尝试过
df['var'].astype('float')
What would be the best way to get all the em-dashs in a dataframe to convert to nans, while keeping the numeric data as floats? 在将数值数据保持为浮点数的同时,将数据框中的所有破折号转换为nan的最佳方法是什么?
You should catch the error at an earlier stage. 你应该在更早的阶段捕获的错误。 Tell
pd.read_excel()
to treat em-dashes as NaNs: 告诉
pd.read_excel()
将破折号视为NaN:
df = pd.read_excel(..., na_values=['–','—'])
I think the most straightforward way to do this would be pd.to_numeric
with the argument errors='coerce'
: 我认为最简单的方法是将
pd.to_numeric
与参数errors='coerce'
:
df['var'] = pd.to_numeric(df['var'], errors='coerce')
If 'coerce', then invalid parsing will be set as NaN
如果为“强制”,则将无效解析设置为NaN
df.replace({'-': None})
is what you are looking for. df.replace({'-': None})
是您要寻找的。 Found in another post on stack overflow. 在堆栈溢出的另一篇文章中找到。
Not sure exactly what was going on with those dashes (which showed up like u'\–' when I would do df.get_value(0,'var')) but I did find a solution that worked, which converted the dashes to nans and kept the numeric data as numbers. 不知道这些破折号到底是怎么回事(当我执行df.get_value(0,'var')时,它们显示为u'\\ u2013'),但是我确实找到了一个可行的解决方案,将破折号转换为nans并将数字数据保留为数字。
import unicodedata
df['var']=df['var'].map(unicode)
df['var']=df['var'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
df['var']=pd.to_numeric(df['var'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.