简体   繁体   中英

How to split sentence to words with regular expression?

"She's so nice!" -> ["she","'","s","so","nice","!"] I want to split sentence like this! so I wrote the code, but It includes white space! How to make code only using regular expression?

        words = re.findall('\W+|\w+')

-> ["she", "'","s", " ", "so", " ", "nice", "!"]

        words = [word for word in words if not word.isspace()]

Regex : [A-Za-z]+|[^A-Za-z ]

In [^A-Za-z ] add chars you don't want to match.

Details:

  • [] Match a single character present in the list
  • [^] Match a single character NOT present in the list
  • + Matches between one and unlimited times
  • | Or

Python code :

text = "She's so nice!"
matches = re.findall(r'[A-Za-z]+|[^A-Za-z ]', text)

Output:

['She', "'", 's', 'so', 'nice', '!']

Code demo

Python's re module doesn't allow you to split on zero-width assertions. You can use python's pypi regex package instead (ensuring you specify to use version 1, which properly handles zero-width matches).

See code in use here

import regex

s = "She's so nice!"
x = regex.split(r"\s+|\b(?!^|$)", s, flags=regex.VERSION1)

print(x)

Output: ['She', "'", 's', 'so', 'nice', '!']

  • \\s+|\\b(?!^|$) Match either of the following options
    • \\s+ Match one or more whitespace characters
    • \\b(?!^|$) Assert position as a word boundary, but not at the beginning or end of the line

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM