I want to remove extra r and n from this string. I tried regex. Not sure if regex or some other method would be helpful here.
This is the code I am trying to use import re
text = "r n r n r nFamily Medical History new r n r n r r r Roger nRobert n nDawson n49 nyears old , right shoulder"
regex_pattern = re.compile(r'\s[rn]\s')
matches = regex_pattern.findall(text)
for match in matches:
text = text.replace(match," ")
print(text)
Current Output:
r nFamily Medical History new Roger nRobert nDawson n49 nyears old , right shoulder
we still see many r n. Also wondering how to remove 'n' from n49, nyears and remove first 'n' from Dawson without removing last 'n'
Expected Output:
Family Medical History new Roger Robert Dawson 49 years old , right shoulder
I would suggest a bit of an NLP approach here as I do not see how regex can tell nyears
(wrong spelling) from new
(correct spelling).
First, remove all standalone r
/ n
and those glued to capitalized words and numbers, then split the string and check each word that starts with n
or r
with a spellchecker. The first n
can be removed if word[1:]
is correct and word
is not. If both are not correct, I think it is safe to fallback to the word
.
To run spellcheck, for example, you can use TextBlob
.
Here is a Python code demo:
from textblob import TextBlob
from textblob import Word
import re
s = "r n r n r nFamily Medical History new r n r n r r r Roger nRobert n nDawson n49 nyears old , right shoulder"
s = re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s)
result = []
for w in s.split():
if not w.startswith(('n','r')): # The w word does not start with n or r...
result.append(w) # Add it to the result
else:
if Word(w).correct() == w: # If w is a correct word
result.append(w) # Add it to the result
else:
if Word(w[1:]).correct() == w[1:]: # If w[1:] is correct
result.append(w[1:]) # Add w[1:] to the result
else:
result.append(w) # Fallback: add w to the result
print(" ".join(result))
# => Family Medical History new Roger Robert Dawson 49 years old , right shoulder
The re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s)
part remove r
and n
at the start of words if immediately followed with an uppercase letter, digit or end of string.
Then, for w in s.split():
iterates over the words in the sentence and replaces the word only in case it starts with n
or r
and has a spelling error with w[1:]
.
DISCLAIMER : TextBlob
is used as an example. You are free to use any other spellchecking library. TextBlob spellchecking " is based on Peter Norvig's “How to Write a Spelling Corrector” 1 as implemented in the pattern library. It is about 70% accurate "
Try something like this \b[rn](?=[A-Z0-9 ])
The \b
looks for any work boundary (start of the string, spaces, newlines).
The [rn]
looks for either 'r' or 'n'
The (?=[A-Z0-9 ])
looks for any uppercase, space, or number but does not include them in the match.
Checkout https://regex101.com/r/hSmYyi/1 for messing around with regexes and for testing.
Old school over here
>>> text = "r n r n r nFamily Medical History new r n r n r r r Roger nRobert n nDawson n49 nyears old , right shoulder"
>>> newText = []
>>> for word in text.split(' '):
... if word and not (word == 'n' or word =='r'):
... if not word[0] == 'n':
... newText.append(word)
... else:
... newText.append(word[1:])
...
>>> newText
['Family', 'Medical', 'History', 'ew', 'Roger', 'Robert', 'Dawson', '49', 'years', 'old', ',', 'right', 'shoulder']
>>> ' '.join(newText)
'Family Medical History ew Roger Robert Dawson 49 years old , right shoulder'
>>>
of course you can refactoring it as you pleasure.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.