简体   繁体   English

将字符串拆分为单词和标点符号

[英]Splitting a string into words and punctuation

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.我正在尝试将一个字符串拆分为单词和标点符号,并将标点符号添加到拆分生成的列表中。

For instance:例如:

>>> c = "help, me"
>>> print c.split()
['help,', 'me']

What I really want the list to look like is:我真正希望列表看起来像的是:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.所以,我希望字符串在空格处拆分,标点符号从单词中拆分出来。

I've tried to parse the string first and then run the split:我尝试先解析字符串,然后运行拆分:

>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.这会产生我想要的结果,但在大文件上速度非常慢。

Is there a way to do this more efficiently?有没有办法更有效地做到这一点?

This is more or less the way to do it:这或多或少是这样做的方法:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.诀窍是,不要考虑在哪里分割字符串,而是要在标记中包含什么。

Caveats:注意事项:

  • The underscore (_) is considered an inner-word character.下划线 (_) 被视为字内字符。 Replace \w, if you don't want that.替换 \w,如果你不想要那个。
  • This will not work with (single) quotes in the string.这不适用于字符串中的(单)引号。
  • Put any additional punctuation marks you want to use in the right half of the regular expression.将要使用的任何其他标点符号放在正则表达式的右半部分。
  • Anything not explicitely mentioned in the re is silently dropped.此处未明确提及的任何内容都将被默默删除。

Here is a Unicode-aware version:这是一个支持 Unicode 的版本:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum'] );第一个替代方案捕获单词字符序列(由 unicode 定义,因此 "résumé" 不会变成['r', 'sum'] ); the second catches individual non-word characters, ignoring whitespace.第二个捕获单个非单词字符,忽略空格。

Note that, unlike the top answer, this treats the single quote as separate punctuation (eg "I'm" -> ['I', "'", 'm'] ).请注意,与最佳答案不同,这将单引号视为单独的标点符号(例如“我是”-> ['I', "'", 'm'] )。 This appears to be standard in NLP, so I consider it a feature.这似乎是 NLP 中的标准,所以我认为它是一个特性。

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing ).如果您打算使用英语(或其他一些通用语言)工作,则可以使用NLTK (还有许多其他工具可以执行此操作,例如FreeLing )。

import nltk
nltk.download('punkt')
sentence = "help, me"
nltk.word_tokenize(sentence)

Here's my entry.这是我的条目。

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the ";.!" grouped together; this may or may not be a good thing).我怀疑这在效率方面的表现如何,或者它是否涵盖了所有情况(请注意“;.!”组合在一起;这可能是好事,也可能不是好事)。

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.一个明显的优化是预先编译正则表达式(使用 re.compile),如果您要逐行执行此操作的话。

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.我正在尝试将字符串拆分为单词和标点符号,将标点符号添加到拆分生成的列表中。

For instance:例如:

>>> c = "help, me"
>>> print c.split()
['help,', 'me']

What I really want the list to look like is:我真正希望列表的样子是:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.所以,我希望字符串在空格处拆分,标点符号从单词中拆分出来。

I've tried to parse the string first and then run the split:我尝试先解析字符串,然后运行拆分:

>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.这产生了我想要的结果,但在大文件上速度很慢。

Is there a way to do this more efficiently?有没有办法更有效地做到这一点?

Here's a minor update to your implementation.这是对您的实施的一个小更新。 If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.如果您想做更详细的事情,我建议您查看 le dorfier 建议的 NLTK。

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster .这可能只会快一点,因为使用 ''.join() 代替 +=,众所周知后者更快

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

This worked for me这对我有用

import re

i = 'Sandra went to the hallway.!!'
l = re.split('(\W+?)', i)
print(l)

empty = ['', ' ']
l = [el for el in l if el not in empty]
print(l)

Output:
['Sandra', ' ', 'went', ' ', 'to', ' ', 'the', ' ', 'hallway', '.', '', '!', '', '!', '']
['Sandra', 'went', 'to', 'the', 'hallway', '.', '!', '!']

I came up with a way to tokenize all words and \W+ patterns using \b which doesn't need hardcoding:我想出了一种使用\b不需要硬编码来标记所有单词和\W+模式的方法:

>>> import re
>>> sentence = 'Hello, world!'
>>> tokens = [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', sentence)]
['Hello', ',', 'world', '!']

Here .*?\S.*?这里.*?\S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.是匹配任何非空格的模式,如果它是标点符号,则添加$以匹配字符串中的最后一个标记。

Note the following though -- this will group punctuation that consists of more than one symbol:不过请注意以下几点——这会将包含多个符号的标点符号分组:

>>> print [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"Oh no", she said')]
['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:当然,您可以通过以下方式找到并拆分此类组:

>>> for token in [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"You can", she said')]:
...     print re.findall(r'(?:\w+|\W)', token)

['You']
['can']
['"', ',']
['she']
['said']

I think you can find all the help you can imagine in the NLTK , especially since you are using python.我认为您可以在NLTK中找到您能想到的所有帮助,特别是因为您使用的是 python。 There's a good comprehensive discussion of this issue in the tutorial.教程中对这个问题进行了很好的综合讨论。

Try this:试试这个:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"
my_list =[]
x = len(string_big)
poistion_ofspace = 0
while poistion_ofspace < x:
    for i in range(poistion_ofspace,x):
        if string_big[i] == ' ':
            break
        else:
            continue
    print string_big[poistion_ofspace:(i+1)]
    my_list.append(string_big[poistion_ofspace:(i+1)])
    poistion_ofspace = i+1

print my_list

Have you tried using a regex?您是否尝试过使用正则表达式?

http://docs.python.org/library/re.html#re-syntax http://docs.python.org/library/re.html#re-syntax


By the way.顺便一提。 Why do you need the "," at the second one?为什么第二个需要“,”? You will know that after each text is written ie你会知道,每篇文字写完后即

[0] [0]

"," ","

[1] [1]

"," ","

So if you want to add the "," you can just do it after each iteration when you use the array..所以如果你想添加“,”你可以在每次迭代后使用数组时添加。

In case you are not allowed to import anything,use this!如果您不允许导入任何东西,请使用它!

word = "Hello,there"
word = word.replace("," , " ," )
word = word.replace("." , " .")
return word.split()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM