I have my phone_number column listed below.
phone_number
--------------
001 1234567890
380 1234567890
27 1234567890
001 +11234567890
2.56898E+11
1 1234567890
123-456-7890
+1 (123) 456-7890
(123) 456-7890
NaN
The following step worked fine
character = '[^0-9]+'
df.phone_number.str.replace(character, '')
The result I got is
phone_number
--------------
11234567890
3.80123E+12
2.71234E+11
11234567890
2.56898E+11
11234567890
1234567890
11234567890
1234567890
NaN
Is there any elegant way to deal with the scientific notation format? I want them to be 11234567890 or longer because of the country code. From there I think I can figure out how to get both international and the US phone number formats. Thanks in advance.
You can use conversion to Int64
/ string
dtypes:
s1 = (pd.to_numeric(df['phone_number'], errors='coerce')
.astype('Int64').astype('string')
)
s2 = df['phone_number'].str.replace(r'\D+', '', regex=True)
df['phone_number_clean'] = s1.fillna(s2)
print(df)
Output:
phone_number phone_number_clean
0 001 1234567890 0011234567890
1 380 1234567890 3801234567890
2 27 1234567890 271234567890
3 001 +11234567890 00111234567890
4 2.56898E+11 256898000000
5 1 1234567890 11234567890
6 123-456-7890 1234567890
7 +1 (123) 456-7890 11234567890
8 (123) 456-7890 1234567890
Note that depending on the float precision and the way the number was converted to scientific notation, you might lose important digits.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.