简体   繁体   English

标点符号后拆分字符串,包括标点符号

[英]Splitting a string after punctuation while including punctuation

I'm trying to split a string of words into a list of words via regex. 我试图通过正则表达式将一串单词拆分成单词列表。 I'm still a bit of a beginner with regular expressions. 我仍然对正则表达式还是有点初学者。

I'm using nltk.regex_tokenize, which is yielding results that are close, but not quite what I want. 我正在使用nltk.regex_tokenize,它产生的结果很接近,但不是我想要的结果。

This is what I have so far: 这是我到目前为止的内容:

>>> import re, codecs, nltk
>>> sentence = "détesté Rochard ! m'étais à... 'C'est hyper-cool.' :) :P"    
>>> pattern = r"""(?x)
    #words with internal hyphens
    | \w+(-\w+)*
    #ellipsis
    | \.\.\.
    #other punctuation tokens
    | [][.,;!?"'():-_`]
    """ 
>>> nltk.regexp_tokenize(sentence.decode("utf8"), pattern)
[u'd\xe9test\xe9', u'Rochard', u'!', u'm', u"'", u'\xe9tais', u'\xe0', u'qu', u"'", u'on', u'...', u"'", u'C', u"'", u'est', u'hyper-cool', u'.', u"'", u':', u')', u':', u'P']

I would like to have the output as follows: 我想要的输出如下:

[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0', u"qu'", u'on', u'...', u"'", u"C'", u'est', u'hyper-cool', u'.', u"'", u':)', u':P']

I have a workaround for the "emoticons", so what I'm most concerned with are quotes. 我有一个针对“表情符号”的解决方法,因此我最关心的是引号。

It seems that the desired output is not consistent with your input sentence 似乎所需的输出与您输入的句子不一致

  1. [u"qu'", u'on'] : I can't figure out from where did these two matches were determined from your sentence [u"qu'", u'on'] :我不知道这两个匹配项是从您的句子中确定的
  2. Why u'.' 你为什么u'.' was not part of u'hyper-cool' (Assuming you want the punctuation as part of the word. 不是u'hyper-cool'一部分(假设您希望标点符号成为单词的一部分。
  3. Why u"'" was not part of u"C'" . 为什么u"'"不属于u"C'" (Assuming you want the punctuation as part of the word. (假设您希望标点符号作为单词的一部分。

Also if you just want regex split, is there any reason why you are using nltk apart from splitting the lines? 另外,如果只希望正则表达式拆分,除了拆分行外,是否有其他原因使用nltk? I have no experience with nltk so would be proposing just a regex solution. 我没有nltk经验,所以只会提出一个regex解决方案。

>>> sentence
u"d\xe9test\xe9 Rochard ! m'\xe9tais \xe0... 'C'est hyper-cool.' :) :P"
>>> pattern=re.compile(
    u"(" #Capturing Group
    "(?:" #Non Capturing
    "[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
    "[\w\-]+"                           #Alphanumeric Unicode Word with hypen
    "[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
    ")"
    "|(?:[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]+)" #1- punctuation
     ")",re.UNICODE)
>>> pattern.findall(sentence)
[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0.', u'..', u"'C'", u'est', u'hyper-cool.', u"'", u':)', u':P']

See if this works for you 看看这是否适合您

If you need more information on Capturing Group, Non-Capturing Group, Character Class, Unicode Match and findall I would suggest you take a cursory glance on the re package of python. 如果您需要有关捕获组,非捕获组,字符类,Unicode匹配和findall的更多信息,我建议您粗略看一下python的re包。 Also I am not sure if the way you are continuing string in multiple lines is appropriate in this scenario. 另外,我不确定在这种情况下以多行继续字符串的方式是否合适。 If you need more information on splitting string across lines (not multi-line strings) please have a look into this . 如果您需要更多有关跨行拆分字符串(而不是多行字符串)的信息,请查看this

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM