Using regex with a positive look-behind to split strings in python

Question

To address one of the comments, my overall goal is to be understand how to implement a regular expression that will allow me to utilize word boundaries in a positive or negative look-behind, since it seems you cannot use quantifiers.

So for my specific case, I want to be able to check that the word preceding a period ('.') is not a capitalized word. Therefore, I could approach this from two separate paths in my mind:

1) Positive look-behind that the word preceding the '.' is all lowercase, however I receive the error that the positive look-behind is zero-width, therefore I cannot use the quantifier '+' like so: (?<=[^AZ][az]+)

2) Negative look-behind that the word preceding the '.' begins with a capitalized letter, like so: (?<![AZ][az])

I would prefer to move forward with some adaptation of option 1, since it makes more sense to me, however open to other suggestions. Would I be able to make use of word boundaries here?

I am using this to eventually split the paragraph into respective sentences, and I would like to stick with regex as opposed to using nltk. The issue mainly resides in dealing with initials or the abbreviations of first names.

CURRENT REGEX:

(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)

INPUT:

Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no.

DESIRED OUTPUT:

Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.

Answer 1

I would recommend re.sub , for your particular case. Your regex simplifies a lot this way, and you don't need to use a lookbehind, since there are a lot of restrictions with those (need to be fixed width and whatnot).

Code

print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))

Output

Koehler rides the bus. 
Bowman was passed into the first grade; Koehler advanced to third grade. 
Jon. Williams walked down the road to school. 
Bowman decided to go fishing; Koehler did not. 
C. Robinson asked to go to recess, and the teacher said no.

Regex Details

(         # first capture group
\b        # word boundary
[a-z]+    # lower case a-z
\.        # literal period
\s*       # any other whitespace characters (added for cosmetic effect)
(?!$)     # negative lookahead - don't insert a newline when you're at the end of a sentence
)

This pattern is replaced by:

\1        # reference to the first capture group 
\n        # a newline

Answer 2

Try

mystr="Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no."
lst=re.findall(r'.+?\b(?![A-Z])\w+\.',mystr)

If multiline then use the below:-

lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)

Both of them would produce...

['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']

Explanation of '.+?\\b(?![AZ])\\w+\\.'

.+?       #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b        #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+       #the whole word
\.        #followed by a dot

Test regex here .
Test code here .

Answer 3

In case you want to create a list of the sentences, here's another option:

# Split into sentences (last word is split off too)    
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)

['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']

# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]

['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']

Using regex with a positive look-behind to split strings in python

Question

3 answers

solution1
3 ACCPTED 2017-09-18 04:11:08

solution2
1 2017-09-18 06:10:02

solution3
0 2017-09-18 04:57:51

Using regex with a positive look-behind to split strings in python

Question

3 answers

solution1 3 ACCPTED 2017-09-18 04:11:08

solution2 1 2017-09-18 06:10:02

solution3 0 2017-09-18 04:57:51

solution1
3 ACCPTED 2017-09-18 04:11:08

solution2
1 2017-09-18 06:10:02

solution3
0 2017-09-18 04:57:51