[英]How to retain float precision with character values using pandas?
I have a data frame like as shown below我有一个如下所示的数据框
df = pd.DataFrame({'source_code':['A250.00','C791.0','716.90','493.90','143.21','134.52'],
'source_description':['test1', 'test1','test2','test3','test4,'test5'],
'key_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
hash_file = pd.DataFrame({'source_id':['A250','C791','716.9','493.9','143.21','134.52'],
'source_code':['test1','test2','test3','test4','test5'],
'hash_id':[911,512,713,814,616,717]})
id_file = hash_file.set_index(['source_id','source_code'])['hash_id']
I would like to update the values of the key_id
column by comparing the source_code
, source_description
columns with source_id
and source_code
columns.我想通过将
source_code
、 source_description
列与source_id
和source_code
列进行比较来更新key_id
列的值。
So, I tried the below based on this post所以,我根据这篇文章尝试了以下内容
df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)
While this works fine in normal scenarios, but for specific scenarios when there is a mismatch like 250
and 250.00
or 791.0
and 791
etc, it doesn't work and produces incorrect output like below虽然这在正常情况下工作正常,但对于特定情况下,当存在
250
和250.00
或791.0
和791
等不匹配时,它不起作用并产生不正确的 output 如下所示
So, I tried converting them to strings but it doesn't work still所以,我尝试将它们转换为字符串,但它仍然不起作用
I expect my output to be like below我希望我的 output 如下所示
If possible, convert values to floats:如果可能,将值转换为浮点数:
df['source_code'] = df['source_code'].astype(float)
hash_file['source_id'] = hash_file['source_id'].astype(float)
id_file = hash_file.set_index(['source_id','source_code'])['hash_id']
df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)
print (df)
source_code source_description key_id
0 250.00 test1 911
1 791.00 test1 512
2 716.90 test2 713
3 493.90 test3 814
4 143.21 test4 616
5 134.52 test5 717
But there should be problem with float precision, one possible trick is multiple values some scalar like 1000
and then convert to integers:但是浮点精度应该有问题,一种可能的技巧是多个值,例如
1000
的标量,然后转换为整数:
df['source_code'] = df['source_code'].astype(float).mul(100).astype(int)
hash_file['source_id'] = hash_file['source_id'].astype(float).mul(100).astype(int)
id_file = hash_file.set_index(['source_id','source_code'])['hash_id']
df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)
print (df)
source_code source_description key_id
0 25000 test1 911
1 79100 test1 512
2 71690 test2 713
3 49390 test3 814
4 14321 test4 616
5 13452 test5 717
EDIT:编辑:
If problem is only last 0
or last .0
use:如果问题只是最后一个
0
或最后一个.0
使用:
df['source_code'] = df['source_code'].str.replace('[\.]*[0]+$','', regex=True)
print (df)
source_code source_description key_id
0 A250 test1 NaN
1 C791 test1 NaN
2 716.9 test2 NaN
3 493.9 test3 NaN
4 143.21 test4 NaN
5 134.52 test5 NaN
id_file = hash_file.set_index(['source_id','source_code'])['hash_id']
df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)
print (df)
source_code source_description key_id
0 A250 test1 911
1 C791 test1 512
2 716.9 test2 713
3 493.9 test3 814
4 143.21 test4 616
5 134.52 test5 717
Better (I hope) regex for remove last .0
if exist:更好的(我希望)正则表达式删除最后一个
.0
(如果存在):
import re
#https://stackoverflow.com/a/44111202/2901002
rgx = re.compile(r'(?:(\.)|(\.\d*?[1-9]\d*?))0+(?=\b|[^0-9])')
df['source_code'] = df['source_code'].str.replace(rgx, r'\2', regex=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.