简体   繁体   English

如何消除十进制值中的噪音(冗余逗号/点) - Python

[英]How to get rid of noise (redundant commas/dots) in decimal values - Python

I have a dataset df with two columns ID and Value .我有一个数据集df ,其中包含两列IDValue Both are of Dtype "object".两者都是 Dtype“对象”。 However, I would like to convert the column Value to Dtype "double" with a dot as decimal separator.但是,我想将列Value转换为 Dtype “double”,并用点作为小数点分隔符。 The problem is that the values of this column contain noise due to the presence of too many commas (eg 0,1,,) - or after replacement too many dots (eg 0.1..).问题是该列的值由于存在太多逗号(例如 0,1,,)或替换后的点太多(例如 0.1..)而包含噪音。 As a result, when I try to convert the Dtype to double, I get the error message: could not convert string to float: '0.2.'结果,当我尝试将 Dtype 转换为 double 时,我收到错误消息: could not convert string to float: '0.2.'

Example code:示例代码:

#required packages
import pandas as pd
import numpy as np
  
# initialize list of lists
data = [[1, '0,1'], [2, '0,2,'], [3, '0,01,,']]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Value'])

#replace comma with dot as separator
df = df.replace(',', '.', regex=True)

#examine dtype per column
df.info()

#convert dtype from object to double
df = df.astype({'Value': np.double}) #this is where the error message appears

The preferred outcome is to have the values within the column Value as 0.1 , 0.2 and 0.01 respectively.首选结果是将列Value中的值分别设为0.10.20.01

How can I get rid of the redundant commas or, after replacement, dots in the values of the column Values ?如何摆脱多余的逗号,或者在替换后,列Values的值中的点?

One option: use string functions to convert and strip the values.一种选择:使用字符串函数来转换和剥离值。 For example:例如:

#required packages                                                                  
import pandas as pd                                                                 
import numpy as np                                                                  
                                                                                    
# initialize list of lists                                                          
data = [[1, '0,1'], [2, '0,2,'], [3, '0,01,,']]                                     
                                                                                    
# Create the pandas DataFrame                                                       
df = pd.DataFrame(data, columns=['ID', 'Value'])                                    
                                                                                    
#replace comma with dot as separator                                                
df['Value'] = df['Value'].str.replace(',', '.', 1).str.rstrip(',') 
                                                                                    
#examine dtype per column                                                           
df.info()                                                                           
                                                                                    
#convert dtype from object to double                                                
df = df.astype({'Value': np.double})
      
print("------ df:")                                                                              
print(df)

prints:印刷:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      3 non-null      int64 
 1   Value   3 non-null      object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
----- df:
   ID  Value
0   1   0.10
1   2   0.20
2   3   0.01

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM