简体   繁体   中英

How to split a sentence string into words, but also make punctuation a separate element

I'm currently trying to tokenize some language data using Python and was curious if there was an efficient or built-in method for splitting strings of sentences into separate words and also separate punctuation characters. For example:

'Hello, my name is John. What's your name?'

If I used split() on this sentence then I would get

['Hello,', 'my', 'name', 'is', 'John.', "What's", 'your', 'name?']

What I want to get is:

['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']

I've tried to use methods such as searching the string, finding punctuation, storing their indices, removing them from the string and then splitting the string, and inserting the punctuation accordingly but this method seems too inefficient especially when dealing with large corpora.

Does anybody know if there's a more efficient way to do this?

Thank you.

You can do a trick:

text = "Hello, my name is John. What's your name?"
text = text.replace(",", " , ") # Add an space before and after the comma
text = text.replace(".", " . ") # Add an space before and after the point
text = text.replace("  ", " ") # Remove possible double spaces
mListtext.split(" ") # Generates your list

Or just this with input:

mList = input().replace(",", " , ").replace(".", " . ")replace("  ", " ").split(" ")

Here is an approach using re.finditer which at least seems to work with the sample data you provided:

inp = "Hello, my name is John. What's your name?"
parts = []
for match in re.finditer(r'[^.,?!\s]+|[.,?!]', inp):
    parts.append(match.group())

print(parts)

Output:

['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']

The idea here is to match one of the following two patterns:

[^.,?!\s]+    which matches any non punctuation, non whitespace character
[.,?!]        which matches a single punctuation character

Presumably anything which is not whitespace or punctuation should be a matching word/term in the sentence.

Note that the really nice way to solve this problem would be try doing a regex split on punctuation or whitespace. But, re.split does not support splitting on zero width lookarounds, so we forced to try re.finditer instead.

You can use re.sub to replace all chars defined in string.punctuation followed by a space after them, with a space before them, and finally can use str.split to split the words

>>> s = "Hello, my name is John. What's your name?"
>>> 
>>> import string, re
>>> re.sub(fr'([{string.punctuation}])\B', r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']

In python2

>>> re.sub(r'([%s])\B' % string.punctuation, r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']

Word tokenisation is not as trivial as it sounds. The previous answers using regular expressions or string replacement won't always deal with such things as acronyms or abbreviations (eg am ., pm , NY , DIY , AD , BC , eg , etc. , ie , Mr. , Ms. , Dr. ). These will be split into separate tokens (eg B , . , C , . ) by such approaches unless you write more complex patterns to deal with such cases (but there will always be annoying exceptions). You will also have to decide what to do with other punctuation like " and ' , $ , % , such things as email addresses and URLs, sequences of digits (eg 5,000.99 , 33.3% ), hyphenated words (eg pre-processing , avant-garde ), names that include punctuation (eg O'Neill ), contractions (eg aren't , can't , let's ), the English possessive marker ( 's ) etc. etc., etc.

I recommend using an NLP library to do this as they should be set up to deal with most of these issue for you (although they do still make "mistakes" that you can try to fix). See:

The first three are full toolkits with many functionalities besides tokenisation. The last is a part-of-speech tagger that tokenises the text. These are just a few and there are other options out there, so try some out and see which works best for you. They will all tokenise your text differently, but in most cases (not sure about TreeTagger) you can modify their tokenisation decisions to correct mistakes.

TweetTokenizer from nltk can also be used for this..

from nltk.tokenize import TweetTokenizer

tokenizer = TweetTokenizer()
tokenizer.tokenize('''Hello, my name is John. What's your name?''')

#op
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM