简体   繁体   中英

Using regex with a positive look-behind to split strings in python

To address one of the comments, my overall goal is to be understand how to implement a regular expression that will allow me to utilize word boundaries in a positive or negative look-behind, since it seems you cannot use quantifiers.

So for my specific case, I want to be able to check that the word preceding a period ('.') is not a capitalized word. Therefore, I could approach this from two separate paths in my mind:

1) Positive look-behind that the word preceding the '.' is all lowercase, however I receive the error that the positive look-behind is zero-width, therefore I cannot use the quantifier '+' like so: (?<=[^AZ][az]+)

2) Negative look-behind that the word preceding the '.' begins with a capitalized letter, like so: (?<![AZ][az])

I would prefer to move forward with some adaptation of option 1, since it makes more sense to me, however open to other suggestions. Would I be able to make use of word boundaries here?

I am using this to eventually split the paragraph into respective sentences, and I would like to stick with regex as opposed to using nltk. The issue mainly resides in dealing with initials or the abbreviations of first names.

CURRENT REGEX:

(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)

INPUT:

Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no.

DESIRED OUTPUT:

Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.

I would recommend re.sub , for your particular case. Your regex simplifies a lot this way, and you don't need to use a lookbehind, since there are a lot of restrictions with those (need to be fixed width and whatnot).

Code

print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))

Output

Koehler rides the bus. 
Bowman was passed into the first grade; Koehler advanced to third grade. 
Jon. Williams walked down the road to school. 
Bowman decided to go fishing; Koehler did not. 
C. Robinson asked to go to recess, and the teacher said no.

Regex Details

(         # first capture group
\b        # word boundary
[a-z]+    # lower case a-z
\.        # literal period
\s*       # any other whitespace characters (added for cosmetic effect)
(?!$)     # negative lookahead - don't insert a newline when you're at the end of a sentence
)

This pattern is replaced by:

\1        # reference to the first capture group 
\n        # a newline

Try

mystr="Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no."
lst=re.findall(r'.+?\b(?![A-Z])\w+\.',mystr)

If multiline then use the below:-

lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)

Both of them would produce...

['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']

Explanation of '.+?\\b(?![AZ])\\w+\\.'

.+?       #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b        #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+       #the whole word
\.        #followed by a dot

Test regex here .
Test code here .

In case you want to create a list of the sentences, here's another option:

# Split into sentences (last word is split off too)    
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)

['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']

# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]

['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM