To address one of the comments, my overall goal is to be understand how to implement a regular expression that will allow me to utilize word boundaries in a positive or negative look-behind, since it seems you cannot use quantifiers.
So for my specific case, I want to be able to check that the word preceding a period ('.') is not a capitalized word. Therefore, I could approach this from two separate paths in my mind:
1) Positive look-behind that the word preceding the '.' is all lowercase, however I receive the error that the positive look-behind is zero-width, therefore I cannot use the quantifier '+' like so: (?<=[^AZ][az]+)
2) Negative look-behind that the word preceding the '.' begins with a capitalized letter, like so: (?<![AZ][az])
I would prefer to move forward with some adaptation of option 1, since it makes more sense to me, however open to other suggestions. Would I be able to make use of word boundaries here?
I am using this to eventually split the paragraph into respective sentences, and I would like to stick with regex as opposed to using nltk. The issue mainly resides in dealing with initials or the abbreviations of first names.
CURRENT REGEX:
(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)
INPUT:
Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no.
DESIRED OUTPUT:
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
I would recommend re.sub
, for your particular case. Your regex simplifies a lot this way, and you don't need to use a lookbehind, since there are a lot of restrictions with those (need to be fixed width and whatnot).
Code
print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))
Output
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
Regex Details
( # first capture group
\b # word boundary
[a-z]+ # lower case a-z
\. # literal period
\s* # any other whitespace characters (added for cosmetic effect)
(?!$) # negative lookahead - don't insert a newline when you're at the end of a sentence
)
This pattern is replaced by:
\1 # reference to the first capture group
\n # a newline
Try
mystr="Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no."
lst=re.findall(r'.+?\b(?![A-Z])\w+\.',mystr)
If multiline then use the below:-
lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)
Both of them would produce...
['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']
Explanation of '.+?\\b(?![AZ])\\w+\\.'
.+? #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+ #the whole word
\. #followed by a dot
In case you want to create a list of the sentences, here's another option:
# Split into sentences (last word is split off too)
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)
['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']
# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]
['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.