简体   繁体   中英

Extracting numerical value with special characters from a string but removing other occurrences of those characters

I am using Python and pandas and have a DataFrame column that contains a string. I want to keep the float number within the string and get rid of '- .' at the end of the float (string).

So far I have been using a regular expression below to get rid of characters and brackets from the original string but it leaves '-' and '.' from the non-numeric part of the string in place.

Example input string : 14,513.045Non-compliant with installation req.

When I try to modify it this is what I get: 14,513.045- . (example of positive number string)

I also want to be able to parse negative numbers, such as: -234.670

The first - in the string is for negative float number. I would like to keep the first - and first . but get rid of the subsequent ones - the ones which do not belong to the number.

This is the code that I tried to use to achieve that:

dataframe3['single_chainage2'] = dataframe3['single_chainage'].str.replace(r"[a-zA-Z*()]",'')

But it leaves me with 14,513.045- .

I saw no way of doing the above using pandas alone and saw that regex was the recommended way.

You dont't need to replace , I think you can use Series.str.extract instead to get the string you need.

In [1]: import pandas as pd                                                                                                                                     

In [2]: ser = pd.Series(["14,513.045Non-compliant with installation req.", "14,513.045- .", "-234.670"])                                                        

In [3]: pat = r'^(?P<num>-?(\d+,)*\d+(\.\d+)?)'

In [5]: ser.str.extract(pat)['num']                                                                                                                             
Out[5]: 
0    14,513.045
1    14,513.045
2      -234.670
Name: num, dtype: object

and a named group is needed in the regex pattern ( num in this example) .

and if need to convert it to numeric dtype:

In [7]: ser.str.extract(pat)['num'].str.replace(',', '').astype(float)                                                                                          
Out[7]: 
0    14513.045
1    14513.045
2     -234.670
Name: num, dtype: float64

Rather than removing the characters that you don't want, just specify a pattern that you want to find and extract it. It should be much less error prone. You want to extract a positive and negative number that can be floating point:

import re
number_match = re.search("[+-]?(\d+,?)*(\.\d+)?", 'Your string.')
number = number_match.group(0)

Testing the code above:

test_string_positive='14,513.045Non-compliant with installation req.'
test_string_negative='-234.670Non-compliant with installation req.'

In [1]: test=re.search("[+-]?(\d+,?)*(\.\d+)?",test_string_positive)

In [2]: test.group(0)
Out[2]: '14,513.045'

In [3]: test=re.search("[+-]?(\d+,?)*(\.\d+)?",test_string_negative)

In [4]: test.group(0)
Out[4]: '-234.670'

With this solution you don't want to do replace but rather just assign the value of the regex match.

number_match = re.search("[+-]?(\d+,?)*(\.\d+)?", <YOUR_STRING>)
number = number_match.group(0)
dataframe3['single_chainage2'] = number

I split that into 3 lines to show you how it logically follows. Hopefully, that makes sense.

You should substitute the value of <YOUR_STRING> with a string representation of data. As for how to get a string value out of a Pandas DataFrame, this question might have some answers to that. I'm not sure about how your DataFrame actually looks but I guess something like df['single_chainage'][0] should work. Basically if you index in Pandas, it returns some Pandas specific info and if you want to retrieve just the string itself you have to specify that explicitly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM