[英]How to split a sentence string into words, but also make punctuation a separate element
I'm currently trying to tokenize some language data using Python and was curious if there was an efficient or built-in method for splitting strings of sentences into separate words and also separate punctuation characters. 我目前正在尝试使用Python标记一些语言数据,并且很好奇是否有一种有效的或内置的方法将句子字符串拆分为单独的单词和标点符号。 For example: 例如:
'Hello, my name is John. What's your name?'
If I used split()
on this sentence then I would get 如果我在这句话上使用split()
,那么我会得到
['Hello,', 'my', 'name', 'is', 'John.', "What's", 'your', 'name?']
What I want to get is: 我想要得到的是:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
I've tried to use methods such as searching the string, finding punctuation, storing their indices, removing them from the string and then splitting the string, and inserting the punctuation accordingly but this method seems too inefficient especially when dealing with large corpora. 我尝试使用诸如搜索字符串,查找标点符号,存储它们的索引,从字符串中删除它们,然后拆分字符串以及相应地插入标点符号之类的方法,但是这种方法似乎效率太低,尤其是在处理大型语料库时。
Does anybody know if there's a more efficient way to do this? 有人知道是否有更有效的方法吗?
Thank you. 谢谢。
You can do a trick: 你可以做个把戏:
text = "Hello, my name is John. What's your name?"
text = text.replace(",", " , ") # Add an space before and after the comma
text = text.replace(".", " . ") # Add an space before and after the point
text = text.replace(" ", " ") # Remove possible double spaces
mListtext.split(" ") # Generates your list
Or just this with input: 或仅此输入:
mList = input().replace(",", " , ").replace(".", " . ")replace(" ", " ").split(" ")
Here is an approach using re.finditer
which at least seems to work with the sample data you provided: 这是一种使用re.finditer
的方法,该方法至少似乎与您提供的样本数据一起使用:
inp = "Hello, my name is John. What's your name?"
parts = []
for match in re.finditer(r'[^.,?!\s]+|[.,?!]', inp):
parts.append(match.group())
print(parts)
Output: 输出:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
The idea here is to match one of the following two patterns: 这里的想法是匹配以下两种模式之一:
[^.,?!\s]+ which matches any non punctuation, non whitespace character
[.,?!] which matches a single punctuation character
Presumably anything which is not whitespace or punctuation should be a matching word/term in the sentence. 大概不是空格或标点符号的任何内容都应该是句子中匹配的单词/术语。
Note that the really nice way to solve this problem would be try doing a regex split on punctuation or whitespace. 请注意,解决此问题的真正好方法是尝试对标点或空格进行正则表达式拆分。 But, re.split
does not support splitting on zero width lookarounds, so we forced to try re.finditer
instead. 但是, re.split
不支持零宽度re.split
拆分,因此我们不得不尝试使用re.finditer
。
You can use re.sub
to replace all chars defined in string.punctuation
followed by a space after them, with a space before them, and finally can use str.split
to split the words 您可以使用re.sub
替换string.punctuation
定义的所有字符,后跟一个空格,并在它们之前加一个空格,最后可以使用str.split
拆分单词
>>> s = "Hello, my name is John. What's your name?"
>>>
>>> import string, re
>>> re.sub(fr'([{string.punctuation}])\B', r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
In python2 在python2中
>>> re.sub(r'([%s])\B' % string.punctuation, r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
Word tokenisation is not as trivial as it sounds. 单词标记化并不像听起来那么琐碎。 The previous answers using regular expressions or string replacement won't always deal with such things as acronyms or abbreviations (eg am
., pm
, NY
, DIY
, AD
, BC
, eg
, etc.
, ie
, Mr.
, Ms.
, Dr.
). 使用正则表达式或字符串替换以前的答案并不总是处理这样的事情,首字母缩写或缩写(例如am
, pm
, NY
, DIY
, AD
, BC
, eg
, etc.
, ie
, Mr.
, Ms.
, Dr.
)。 These will be split into separate tokens (eg B
, .
, C
, .
) by such approaches unless you write more complex patterns to deal with such cases (but there will always be annoying exceptions). 这些将被分成不同的记号(如B
, .
, C
, .
通过这种方法,除非你写的更为复杂的模式来处理这类案件(但总是会有恼人的例外))。 You will also have to decide what to do with other punctuation like "
and '
, $
, %
, such things as email addresses and URLs, sequences of digits (eg 5,000.99
, 33.3%
), hyphenated words (eg pre-processing
, avant-garde
), names that include punctuation (eg O'Neill
), contractions (eg aren't
, can't
, let's
), the English possessive marker ( 's
) etc. etc., etc. 您还必须决定如何处理其他标点符号,例如"
和'
, $
, %
,例如电子邮件地址和URL,数字序列(例如5,000.99
33.3%
),带连字符的词(例如, pre-processing
avant-garde
),包括标点符号(例如O'Neill
),收缩(例如aren't
, can't
, let's
),英语所有格标记( 's
)等的名称,等等。
I recommend using an NLP library to do this as they should be set up to deal with most of these issue for you (although they do still make "mistakes" that you can try to fix). 我建议使用NLP库来执行此操作,因为应该将它们设置为为您解决大多数此类问题(尽管它们仍然会产生“错误”,您可以尝试解决)。 See: 看到:
The first three are full toolkits with many functionalities besides tokenisation. 前三个是完整的工具包,除了令牌化外还具有许多功能。 The last is a part-of-speech tagger that tokenises the text. 最后一个是词性标记器,用于标记文本。 These are just a few and there are other options out there, so try some out and see which works best for you. 这些只是其中的一些,还有其他选择,因此请尝试一下,看看哪种最适合您。 They will all tokenise your text differently, but in most cases (not sure about TreeTagger) you can modify their tokenisation decisions to correct mistakes. 他们都会以不同的方式标记您的文本,但是在大多数情况下(不确定TreeTagger),您可以修改其标记化决策以更正错误。
TweetTokenizer from nltk can also be used for this.. 来自nltk的TweetTokenizer也可以用于此。
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
tokenizer.tokenize('''Hello, my name is John. What's your name?''')
#op
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.