简体   繁体   English

从字符串中提取单词,删除标点符号并返回带有分隔单词的列表

[英]Extracting words from a string, removing punctuation and returning a list with separated words

I was wondering how to implement a function get_words() that returns the words in a string in a list, stripping away the punctuation. 我想知道如何实现一个函数get_words() ,它返回列表中字符串中的单词, get_words()标点符号。

How I would like to have it implemented is replace non string.ascii_letters with '' and return a .split() . 我希望如何实现它是用''替换非string.ascii_letters并返回.split()

def get_words(text):

    '''The function should take one argument which is a string'''

    returns text.split()

For example: 例如:

>>>get_words('Hello world, my name is...James!')

returns: 收益:

>>>['Hello', 'world', 'my', 'name', 'is', 'James']

This has nothing to do with splitting and punctuation; 这与分裂和标点符号无关; you just care about the letters (and numbers), and just want a regular expression: 你只关心字母(和数字),只想要一个正则表达式:

import re
def getWords(text):
    return re.compile('\w+').findall(text)

Demo: 演示:

>>> re.compile('\w+').findall('Hello world, my name is...James the 2nd!')
['Hello', 'world', 'my', 'name', 'is', 'James', 'the', '2nd']

If you don't care about numbers, replace \\w with [A-Za-z] for just letters, or [A-Za-z'] to include contractions, etc. There are probably fancier ways to include alphabetic-non-numeric character classes (eg letters with accents) with other regex. 如果你不关心数字,用[A-Za-z]替换\\w仅用字母,或者用[A-Za-z']代替收缩等等。可能有更好的方法来包括字母 - 非带有其他正则表达式的数字字符类(例如带重音的字母)。


I almost answered this question here: Split Strings with Multiple Delimiters? 我几乎在这里回答了这个问题: Split Strings with Multiple Delimiters?

But your question is actually under-specified: Do you want 'this is: an example' to be split into: 但是你的问题实际上是不明确的:你想要'this is: an example'被分成:

  • ['this', 'is', 'an', 'example']
  • or ['this', 'is', 'an', '', 'example'] ? 还是['this', 'is', 'an', '', 'example']

I assumed it was the first case. 我以为这是第一个案例。


[this', 'is', 'an', example'] is what i want. [这个','是','一个',例子']是我想要的。 is there a method without importing regex? 有没有导入正则表达式的方法? If we can just replace the non ascii_letters with '', then splitting the string into words in a list, would that work? 如果我们可以用''替换非ascii_letters,然后将字符串拆分成列表中的单词,那会有效吗? – James Smith 2 mins ago - 詹姆斯史密斯2分钟前

The regexp is the most elegant, but yes, you could this as follows: 正则表达式是最优雅的,但是,你可以这样做如下:

def getWords(text):
    """
        Returns a list of words, where a word is defined as a
        maximally connected substring of uppercase or lowercase
        alphabetic letters, as defined by "a".isalpha()

        >>> get_words('Hello world, my name is... Élise!')  # works in python3
        ['Hello', 'world', 'my', 'name', 'is', 'Élise']
    """
    return ''.join((c if c.isalnum() else ' ') for c in text).split()

or .isalpha() .isalpha()


Sidenote: You could also do the following, though it requires importing another standard library: 旁注:您也可以执行以下操作,但需要导入另一个标准库:

from itertools import *

# groupby is generally always overkill and makes for unreadable code
# ... but is fun

def getWords(text):
    return [
        ''.join(chars)
            for isWord,chars in 
            groupby(' My name, is test!', lambda c:c.isalnum()) 
            if isWord
    ]

If this is homework, they're probably looking for an imperative thing like a two-state Finite State Machine where the state is "was the last character a letter" and if the state changes from letter -> non-letter then you output a word. 如果这是作业,他们可能正在寻找像两状态有限状态机这样的命令式事情,其中​​状态是“是字母的最后一个字符”,如果状态从字母改变 - >非字母然后输出字。 Don't do that; 不要那样做; it's not a good way to program (though sometimes the abstraction is useful). 它不是一个好的编程方式(尽管有时抽象很有用)。

Try to use re : 尝试使用re

>>> [w for w in re.split('\W', 'Hello world, my name is...James!') if w]
['Hello', 'world', 'my', 'name', 'is', 'James']

Although I'm not sure that it will catch all your use cases. 虽然我不确定它会抓住你所有的用例。

If you want to solve it in another way, you may specify characters that you want to be in result: 如果要以其他方式解决它,可以指定要在结果中出现的字符:

>>> re.findall('[%s]+' % string.ascii_letters, 'Hello world, my name is...James!')
['Hello', 'world', 'my', 'name', 'is', 'James']

All you need is a tokenizer. 您只需要一个标记器。 Have a look at nltk and especially at WordPunctTokenizer. 看看nltk ,尤其是WordPunctTokenizer。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM