[英]Splitting the sentences in python
I am trying to split the sentences in words. 我试图用句子分割句子。
words = content.lower().split()
this gives me the list of words like 这给了我一些单词列表
'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'
and with this code: 并使用此代码:
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
I get something like: 我有类似的东西:
'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'
if you see the word "morningthe" in the list, it used to have "--" in between words. 如果你在列表中看到“morningthe”这个词,它曾经在单词之间有“ - ”。 Now, is there any way I can split them in two words like "morning","the"
?? 现在,有什么方法可以用"morning","the"
这两个词来分割它们吗?
I would suggest a regex-based solution: 我建议使用基于正则表达式的解决方案:
import re
def to_words(text):
return re.findall(r'\w+', text)
This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace. 这会查找所有单词 - 字母字符组,忽略符号,分隔符和空格。
>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']
Note that if you're looping over the words, using re.finditer
which returns a generator object is probably better, as you don't have store the whole list of words at once. 请注意,如果您循环使用单词,则使用返回生成器对象的re.finditer
可能会更好,因为您没有一次存储整个单词列表。
Alternatively, you may also use itertools.groupby
along with str.alpha()
to extract alphabets-only words from the string as: 或者,您也可以使用itertools.groupby
和str.alpha()
从字符串中提取仅限字母的单词 :
>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'
>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']
PS: Regex based solution is much cleaner. PS:基于正则表达式的解决方案更清晰。 I have mentioned this as an possible alternative to achieve this. 我已经提到这是实现这一目标的可能替代方案。
Specific to OP : If all you want is to also split on --
in the resultant list, then you may firstly replace hyphens '-'
with space ' '
before performing split. 特定于OP :如果你想要的只是在结果列表中拆分--
那么你可以在执行拆分之前首先用连字符替换连字符'-'
和空格' '
。 Hence, your code should be: 因此,您的代码应该是:
words = content.lower().replace('-', ' ').split()
where words
will hold the value you desire. words
将保持您想要的价值。
Trying to do this with regexes will send you crazy eg 尝试用正则表达式做这件事会让你发疯,例如
>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']
Definitely look at the nltk
package. 绝对看看nltk
包。
Besides the solutions given already, you could also improve your clean_up_list
function to do a better work. 除了已经给出的解决方案,您还可以改进clean_up_list
函数以做更好的工作。
def clean_up_list(word_list):
clean_word_list = []
# Move the list out of loop so that it doesn't
# have to be initiated every time.
symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
for word in word_list:
current_word = ''
for index in range(len(word)):
if word[index] in symbols:
if current_word:
clean_word_list.append(current_word)
current_word = ''
else:
current_word += word[index]
if current_word:
# Append possible last current_word
clean_word_list.append(current_word)
return clean_word_list
Actually, you could apply the block in for word in word_list:
to the whole sentence to get the same result. 实际上,您可以将for word in word_list:
中for word in word_list:
块应用于整个句子以获得相同的结果。
You could also do this: 你也可以这样做:
import re
def word_list(text):
return list(filter(None, re.split('\W+', text)))
print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))
Returns: 返回:
['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.