简体   繁体   中英

Python regex: replace numbers and special characters except years

I want to replace all non-alphabetic characters with spaces, excluding years between 1950 and 2029. Eg:

ab-c 0123 4r. a2017 2010 ab-c 0123 4r. a2017 2010 -> ab cra 2010

My attempt so far, trying to blacklist the dates via a negative look-ahead:

re.sub('(?!\b19[5-9][0-9]\b|\b20[0-2][0-9]\b)([^A-Za-z]+)', ' ', string)

Since this doesn't work, any help is greatly appreciated!

You could use a simple regex and pass a function to check if it's a year:

import re

def replace_non_year_numbers(m):
  number = int(m.group(0))
  if 1950 <= number <= 2029:
    return str(number)
  else:
    return ''

print(re.sub('\d+', replace_non_year_numbers, 'ab-c 0123 4r. a2017 2010'))
# 'ab-c  r. a2017 2010'

To keep the regex and the logic simple, you could remove special characters in a second step:

only_years = re.sub('\d+', replace_non_year_numbers, 'ab-c 0123 4r. a2017 2010')
no_special_char = re.sub('[^A-Za-z0-9 ]', ' ', only_years)
print(re.sub(' +', ' ', no_special_char))
# ab c r a2017 2010

Let's select what you want to keep in your result. Look at the regex:

(
  (?<!\w)                       # neg. lookbehind: not a word char
  (1                            # read a '1'
     (?=9[5-9][0-9])            # lookahead: following 3 digits make it
                                #   a year between 1950 and 1999
     [0-9]{3}                   # THEN read these 3 digits
   |                            # - OR -
   2                            # read a '2'
     (?=0[0-2][0-9])            # lookahead: following 3 digits make it
                                #   a year between 2000 and 2029
     [0-9]{3}                   # THEN read these 3 digits 
  )
  |                             # - OR -
  [a-zA-Z]                      # read some letter
)+

in a oneliner:

((?<!\w)(1(?=9[5-9][0-9])[0-9]{3}|2(?=0[0-2][0-9])[0-9]{3})|[a-zA-Z])+

You can test it on regex 101

Let's put that in a python script:

$ cat test.py
import re

pattern = r"(?:(?<!\w)(?:1(?=9[5-9][0-9])[0-9]{3}|2(?=0[0-2][0-9])[0-9]{3})|[a-zA-Z])+"

tests = ["ab-c 0123 4r. a2017 2010 a1955 1955 abc"]

for elt in tests:
   matches = re.findall(pattern, elt)
   print ' '.join(matches)

which gives:

$ python test.py
ab c r a 2010 a 1955 abc

Not too pretty, but I would use multiple replaces:

import re

def check_if_year(m):
  number = int(m.group(0))
  if 1950 <= number <= 2029:
    return str(number)
  else:
    return ' '

s = 'ab-c 0123 4r. a2017 2010 1800'             # Added 1800 for testing
print(s)
print('ab c r a 2010')
t = re.sub(r'[^A-Za-z0-9 ]+', ' ', s)           # Only non-alphanumeric
t = re.sub(r'(?!\b\d{4}\b)(?<!\d)\d+', ' ', t)  # Only numbers that aren't standalone 4 digits
t = re.sub(r'\d+', check_if_year, t)            # Only standalone 4 digits number and test for year
t = re.sub(r' {2,}', ' ', t).strip()            # Clean up extra spaces
print(t)

ideone demo

(?!\b\d{4}\b)(?<!\d)\d+

Will match any number as long as it's not a 4 digit number 'standing alone' (no characters except whitespace or string start/end around it), and I'm using (?<!\\d) so that it won't attempt matching in the middle of a number.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM