简体   繁体   中英

Remove non-numeric characters including numbers that form a URL

I have a string which is comprised of a set of numbers and a URL. I only need all numeric characters except the ones attached to the URL. Below is my code to remove all non-numeric characters but it doesn't remove the numbers from the URL.

test = '4758 11b98https://www.website11/111'
re.sub("[^0-9]","",test)

expected result: 47581198

original answer

Change strategy, it is much easier to just keep the leading numbers and ignore the rest:

import re
test = '47581198https://www.website11/111'
re.findall(r'^\d+', test)[0]

Or, using match, if it is not sure that the leading numbers are present:

m = re.match(r'\d+', test)
if m:
    m = m.group()

Output: '47581198'

Edit after question change

If you're sure that the 'http://' string cannot be in your initial number.

Then you need two passes, one to remove the URL, and another to clean the number.

test = '4758 11b98https://www.website11/1111'
re.sub('\D', '', re.sub('https?://.*', '', test))

Output: '47581198'

Please check the below expression:

y=re.compile('([0-9]+)(?=.*http)')
tokens = y.findall(test)
print(''.join(tokens))

You could match a string that contains https:// or http:// to not capture digits attached to it, and use an alternation | to capture the other digits in group 1.

Then in the output, join all the digits from group 1 with an empty string.

https?://\S+|(\d+)

Regex demo | Python demo

For example

import re

pattern = r"https?://\S+|(\d+)"
s = "4758 11b98https://www.website11/111"

print(''.join(re.findall(pattern, s)))

Output

47581198

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM