简体   繁体   English

如何使用 pandas 保持字符值的浮点精度?

[英]How to retain float precision with character values using pandas?

I have a data frame like as shown below我有一个如下所示的数据框

df = pd.DataFrame({'source_code':['A250.00','C791.0','716.90','493.90','143.21','134.52'],
                   'source_description':['test1', 'test1','test2','test3','test4,'test5'],
                   'key_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})

hash_file = pd.DataFrame({'source_id':['A250','C791','716.9','493.9','143.21','134.52'],
                          'source_code':['test1','test2','test3','test4','test5'],
                          'hash_id':[911,512,713,814,616,717]})
id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

I would like to update the values of the key_id column by comparing the source_code , source_description columns with source_id and source_code columns.我想通过将source_codesource_description列与source_idsource_code列进行比较来更新key_id列的值。

So, I tried the below based on this post所以,我根据这篇文章尝试了以下内容

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)

While this works fine in normal scenarios, but for specific scenarios when there is a mismatch like 250 and 250.00 or 791.0 and 791 etc, it doesn't work and produces incorrect output like below虽然这在正常情况下工作正常,但对于特定情况下,当存在250250.00791.0791等不匹配时,它不起作用并产生不正确的 output 如下所示

在此处输入图像描述

So, I tried converting them to strings but it doesn't work still所以,我尝试将它们转换为字符串,但它仍然不起作用

I expect my output to be like below我希望我的 output 如下所示

在此处输入图像描述

If possible, convert values to floats:如果可能,将值转换为浮点数:

df['source_code'] = df['source_code'].astype(float)
hash_file['source_id'] = hash_file['source_id'].astype(float)

id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)


print (df)
   source_code source_description  key_id
0       250.00              test1     911
1       791.00              test1     512
2       716.90              test2     713
3       493.90              test3     814
4       143.21              test4     616
5       134.52              test5     717

But there should be problem with float precision, one possible trick is multiple values some scalar like 1000 and then convert to integers:但是浮点精度应该有问题,一种可能的技巧是多个值,例如1000的标量,然后转换为整数:

df['source_code'] = df['source_code'].astype(float).mul(100).astype(int)
hash_file['source_id'] = hash_file['source_id'].astype(float).mul(100).astype(int)

id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)


print (df)
   source_code source_description  key_id
0        25000              test1     911
1        79100              test1     512
2        71690              test2     713
3        49390              test3     814
4        14321              test4     616
5        13452              test5     717

EDIT:编辑:

If problem is only last 0 or last .0 use:如果问题只是最后一个0或最后一个.0使用:

df['source_code'] = df['source_code'].str.replace('[\.]*[0]+$','', regex=True)
print (df)
  source_code source_description  key_id
0        A250              test1     NaN
1        C791              test1     NaN
2       716.9              test2     NaN
3       493.9              test3     NaN
4      143.21              test4     NaN
5      134.52              test5     NaN

id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)

print (df)
  source_code source_description  key_id
0        A250              test1     911
1        C791              test1     512
2       716.9              test2     713
3       493.9              test3     814
4      143.21              test4     616
5      134.52              test5     717

Better (I hope) regex for remove last .0 if exist:更好的(我希望)正则表达式删除最后一个.0 (如果存在):

import re

#https://stackoverflow.com/a/44111202/2901002
rgx = re.compile(r'(?:(\.)|(\.\d*?[1-9]\d*?))0+(?=\b|[^0-9])')
df['source_code'] = df['source_code'].str.replace(rgx, r'\2', regex=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM