简体   繁体   中英

Split by '.' when not preceded by digit

I want to split '10.1 This is a sentence. Another sentence.' '10.1 This is a sentence. Another sentence.' as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence'] ['10.1. This is a sentence', 'Another sentence']

I have tried

s.split(r'\D.\D')

It doesn't work, how can this be solved?

If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:

re.split(r'(?<!\d)\.(?!\d|$)', text)

See the regex demo .

If your strings can contain more special cases, you could use a more customizable extracting approach:

re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)

See this regex demo . Details :

  • (?:\d+(?:\.\d+)*\.?|[^.])+ - a non-capturing group that matches one or more occurrences of
    • \d+(?:\.\d+)*\.? - one or more digits ( \d+ ), then zero or more sequences of . and one or more digits ( (?:\.\d+)* ) and then an optional . char ( \.? )
    • | - or
    • [^.] - any char other than a . char.

You have multiple issues:

  1. You're not using re.split() , you're using str.split() .
  2. You haven't escaped the . , use \. instead.
  3. You're not using lookahead and lookbehinds so your 3 characters are gone.

Fixed code:

>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']

Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.

All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is backwards. You could potentially find all kinds of situations that you DON'T want, but it is generally much easier to describe the situation that you DO want. In this case '. ' is that situation.

import re

doc = '10.1 This is a sentence. Another sentence.'

def sentences(doc):
    #split all sentences
    s = re.split(r'\.\s+', doc)

    #remove empty index or remove period from absolute last index, if present
    if s[-1] == '':
        s     = s[0:-1]
    elif s[-1].endswith('.'):
        s[-1] = s[-1][:-1]

    #return sentences
    return s

print(sentences(doc))

The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM