简体   繁体   中英

remove r n r n from string

I want to remove extra r and n from this string. I tried regex. Not sure if regex or some other method would be helpful here.

This is the code I am trying to use import re

text = "r n r n r nFamily Medical History new r n  r n r r r  Roger nRobert n nDawson n49 nyears old , right shoulder"

regex_pattern = re.compile(r'\s[rn]\s')
matches = regex_pattern.findall(text)
for match in matches:
    text = text.replace(match," ")
print(text)

Current Output:

r nFamily Medical History new   Roger nRobert nDawson n49 nyears old , right shoulder 

we still see many r n. Also wondering how to remove 'n' from n49, nyears and remove first 'n' from Dawson without removing last 'n'

Expected Output:

Family Medical History new Roger Robert Dawson 49 years old , right shoulder

I would suggest a bit of an NLP approach here as I do not see how regex can tell nyears (wrong spelling) from new (correct spelling).

First, remove all standalone r / n and those glued to capitalized words and numbers, then split the string and check each word that starts with n or r with a spellchecker. The first n can be removed if word[1:] is correct and word is not. If both are not correct, I think it is safe to fallback to the word .

To run spellcheck, for example, you can use TextBlob .

Here is a Python code demo:

from textblob import TextBlob
from textblob import Word
import re

s = "r n r n r nFamily Medical History new r n  r n r r r  Roger nRobert n nDawson n49 nyears old , right shoulder"
s = re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s)
result = []
for w in s.split():
  if not w.startswith(('n','r')): # The w word does not start with n or r...
    result.append(w)              # Add it to the result
  else:
    if Word(w).correct() == w:    # If w is a correct word
      result.append(w)            # Add it to the result
    else:
      if Word(w[1:]).correct() == w[1:]: # If w[1:] is correct 
        result.append(w[1:])             # Add w[1:] to the result
      else:
        result.append(w)                 # Fallback: add w to the result
print(" ".join(result))
# => Family Medical History new Roger Robert Dawson 49 years old , right shoulder

The re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s) part remove r and n at the start of words if immediately followed with an uppercase letter, digit or end of string.

Then, for w in s.split(): iterates over the words in the sentence and replaces the word only in case it starts with n or r and has a spelling error with w[1:] .

DISCLAIMER : TextBlob is used as an example. You are free to use any other spellchecking library. TextBlob spellchecking " is based on Peter Norvig's “How to Write a Spelling Corrector” 1 as implemented in the pattern library. It is about 70% accurate "

Try something like this \b[rn](?=[A-Z0-9 ])

The \b looks for any work boundary (start of the string, spaces, newlines).

The [rn] looks for either 'r' or 'n'

The (?=[A-Z0-9 ]) looks for any uppercase, space, or number but does not include them in the match.

Checkout https://regex101.com/r/hSmYyi/1 for messing around with regexes and for testing.

Old school over here

>>> text = "r n r n r nFamily Medical History new r n  r n r r r  Roger nRobert n nDawson n49 nyears old , right shoulder"
>>> newText = []
>>> for word in text.split(' '):
...     if word and not (word == 'n' or word =='r'):
...         if not word[0] == 'n':
...             newText.append(word)
...         else:
...             newText.append(word[1:])
... 
>>> newText
['Family', 'Medical', 'History', 'ew', 'Roger', 'Robert', 'Dawson', '49', 'years', 'old', ',', 'right', 'shoulder']
>>> ' '.join(newText)
'Family Medical History ew Roger Robert Dawson 49 years old , right shoulder'
>>> 

of course you can refactoring it as you pleasure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM