[英]How to get rid of noise (redundant commas/dots) in decimal values - Python
I have a dataset df
with two columns ID
and Value
.我有一个数据集df
,其中包含两列ID
和Value
。 Both are of Dtype "object".两者都是 Dtype“对象”。 However, I would like to convert the column Value
to Dtype "double" with a dot as decimal separator.但是,我想将列Value
转换为 Dtype “double”,并用点作为小数点分隔符。 The problem is that the values of this column contain noise due to the presence of too many commas (eg 0,1,,) - or after replacement too many dots (eg 0.1..).问题是该列的值由于存在太多逗号(例如 0,1,,)或替换后的点太多(例如 0.1..)而包含噪音。 As a result, when I try to convert the Dtype to double, I get the error message: could not convert string to float: '0.2.'
结果,当我尝试将 Dtype 转换为 double 时,我收到错误消息: could not convert string to float: '0.2.'
Example code:示例代码:
#required packages
import pandas as pd
import numpy as np
# initialize list of lists
data = [[1, '0,1'], [2, '0,2,'], [3, '0,01,,']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Value'])
#replace comma with dot as separator
df = df.replace(',', '.', regex=True)
#examine dtype per column
df.info()
#convert dtype from object to double
df = df.astype({'Value': np.double}) #this is where the error message appears
The preferred outcome is to have the values within the column Value
as 0.1
, 0.2
and 0.01
respectively.首选结果是将列Value
中的值分别设为0.1
、 0.2
和0.01
。
How can I get rid of the redundant commas or, after replacement, dots in the values of the column Values
?如何摆脱多余的逗号,或者在替换后,列Values
的值中的点?
One option: use string functions to convert and strip the values.一种选择:使用字符串函数来转换和剥离值。 For example:例如:
#required packages
import pandas as pd
import numpy as np
# initialize list of lists
data = [[1, '0,1'], [2, '0,2,'], [3, '0,01,,']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Value'])
#replace comma with dot as separator
df['Value'] = df['Value'].str.replace(',', '.', 1).str.rstrip(',')
#examine dtype per column
df.info()
#convert dtype from object to double
df = df.astype({'Value': np.double})
print("------ df:")
print(df)
prints:印刷:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 3 non-null int64
1 Value 3 non-null object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
----- df:
ID Value
0 1 0.10
1 2 0.20
2 3 0.01
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.