拆分python中的句子

Question

I am trying to split the sentences in words. 我试图用句子分割句子。

words = content.lower().split()

this gives me the list of words like 这给了我一些单词列表

'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'

and with this code: 并使用此代码：

def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:
        symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
        for i in range(0, len(symbols)):
            word = word.replace(symbols[i], "")
        if len(word) > 0:
            clean_word_list.append(word)

I get something like: 我有类似的东西：

'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'

if you see the word "morningthe" in the list, it used to have "--" in between words. 如果你在列表中看到“morningthe”这个词，它曾经在单词之间有“ - ”。 Now, is there any way I can split them in two words like "morning","the" ?? 现在，有什么方法可以用"morning","the"这两个词来分割它们吗？

Answer 1

I would suggest a regex-based solution: 我建议使用基于正则表达式的解决方案：

import re

def to_words(text):
    return re.findall(r'\w+', text)

This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace. 这会查找所有单词 - 字母字符组，忽略符号，分隔符和空格。

>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']

Note that if you're looping over the words, using re.finditer which returns a generator object is probably better, as you don't have store the whole list of words at once. 请注意，如果您循环使用单词，则使用返回生成器对象的re.finditer可能会更好，因为您没有一次存储整个单词列表。

Answer 2

Alternatively, you may also use itertools.groupby along with str.alpha() to extract alphabets-only words from the string as: 或者，您也可以使用itertools.groupby和str.alpha()从字符串中提取仅限字母的单词 ：

>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'

>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']

PS: Regex based solution is much cleaner. PS：基于正则表达式的解决方案更清晰。 I have mentioned this as an possible alternative to achieve this. 我已经提到这是实现这一目标的可能替代方案。

Specific to OP : If all you want is to also split on -- in the resultant list, then you may firstly replace hyphens '-' with space ' ' before performing split. 特定于OP ：如果你想要的只是在结果列表中拆分--那么你可以在执行拆分之前首先用连字符替换连字符'-'和空格' ' 。 Hence, your code should be: 因此，您的代码应该是：

words = content.lower().replace('-', ' ').split()

where words will hold the value you desire. words将保持您想要的价值。

Answer 3

Trying to do this with regexes will send you crazy eg 尝试用正则表达式做这件事会让你发疯，例如

>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']

Definitely look at the nltk package. 绝对看看nltk包。

Answer 4

Besides the solutions given already, you could also improve your clean_up_list function to do a better work. 除了已经给出的解决方案，您还可以改进clean_up_list函数以做更好的工作。

def clean_up_list(word_list):
    clean_word_list = []
    # Move the list out of loop so that it doesn't
    # have to be initiated every time.
    symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"

    for word in word_list:
        current_word = ''
        for index in range(len(word)):
            if word[index] in symbols:
                if current_word:
                    clean_word_list.append(current_word)
                    current_word = ''
            else:
                current_word += word[index]

        if current_word:
            # Append possible last current_word
            clean_word_list.append(current_word)

    return clean_word_list

Actually, you could apply the block in for word in word_list: to the whole sentence to get the same result. 实际上，您可以将for word in word_list:中for word in word_list:块应用于整个句子以获得相同的结果。

Answer 5

You could also do this: 你也可以这样做：

import re

def word_list(text):
  return list(filter(None, re.split('\W+', text)))

print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))

Returns: 返回：

['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']

拆分python中的句子

问题描述

5 个解决方案

解决方案1
3 2017-01-27 22:02:14

解决方案2
3 2017-01-27 22:05:44

解决方案3
1 2017-01-27 22:23:26

解决方案4
0 2017-01-27 22:33:06

解决方案5
0 2017-01-28 03:45:08

拆分python中的句子

问题描述

5 个解决方案

解决方案1 3 2017-01-27 22:02:14

解决方案2 3 2017-01-27 22:05:44

解决方案3 1 2017-01-27 22:23:26

解决方案4 0 2017-01-27 22:33:06

解决方案5 0 2017-01-28 03:45:08

解决方案1
3 2017-01-27 22:02:14

解决方案2
3 2017-01-27 22:05:44

解决方案3
1 2017-01-27 22:23:26

解决方案4
0 2017-01-27 22:33:06

解决方案5
0 2017-01-28 03:45:08