拆分python中的句子

Question

我試圖用句子分割句子。

words = content.lower().split()

這給了我一些單詞列表

'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'

並使用此代碼：

def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:
        symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
        for i in range(0, len(symbols)):
            word = word.replace(symbols[i], "")
        if len(word) > 0:
            clean_word_list.append(word)

我有類似的東西：

'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'

如果你在列表中看到“morningthe”這個詞，它曾經在單詞之間有“ - ”。 現在，有什么方法可以用"morning","the"這兩個詞來分割它們嗎？

Answer 1

我建議使用基於正則表達式的解決方案：

import re

def to_words(text):
    return re.findall(r'\w+', text)

這會查找所有單詞 - 字母字符組，忽略符號，分隔符和空格。

>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']

請注意，如果您循環使用單詞，則使用返回生成器對象的re.finditer可能會更好，因為您沒有一次存儲整個單詞列表。

Answer 2

或者，您也可以使用itertools.groupby和str.alpha()從字符串中提取僅限字母的單詞 ：

>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'

>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']

PS：基於正則表達式的解決方案更清晰。 我已經提到這是實現這一目標的可能替代方案。

特定於OP ：如果你想要的只是在結果列表中拆分--那么你可以在執行拆分之前首先用連字符替換連字符'-'和空格' ' 。 因此，您的代碼應該是：

words = content.lower().replace('-', ' ').split()

words將保持您想要的價值。

Answer 3

嘗試用正則表達式做這件事會讓你發瘋，例如

>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']

絕對看看nltk包。

Answer 4

除了已經給出的解決方案，您還可以改進clean_up_list函數以做更好的工作。

def clean_up_list(word_list):
    clean_word_list = []
    # Move the list out of loop so that it doesn't
    # have to be initiated every time.
    symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"

    for word in word_list:
        current_word = ''
        for index in range(len(word)):
            if word[index] in symbols:
                if current_word:
                    clean_word_list.append(current_word)
                    current_word = ''
            else:
                current_word += word[index]

        if current_word:
            # Append possible last current_word
            clean_word_list.append(current_word)

    return clean_word_list

實際上，您可以將for word in word_list:中for word in word_list:塊應用於整個句子以獲得相同的結果。

Answer 5

你也可以這樣做：

import re

def word_list(text):
  return list(filter(None, re.split('\W+', text)))

print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))

返回：

['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']

拆分python中的句子

問題描述

5 個解決方案

解決方案1
3 2017-01-27 22:02:14

解決方案2
3 2017-01-27 22:05:44

解決方案3
1 2017-01-27 22:23:26

解決方案4
0 2017-01-27 22:33:06

解決方案5
0 2017-01-28 03:45:08

拆分python中的句子

問題描述

5 個解決方案

解決方案1 3 2017-01-27 22:02:14

解決方案2 3 2017-01-27 22:05:44

解決方案3 1 2017-01-27 22:23:26

解決方案4 0 2017-01-27 22:33:06

解決方案5 0 2017-01-28 03:45:08

解決方案1
3 2017-01-27 22:02:14

解決方案2
3 2017-01-27 22:05:44

解決方案3
1 2017-01-27 22:23:26

解決方案4
0 2017-01-27 22:33:06

解決方案5
0 2017-01-28 03:45:08