[英]how to remove single character(letter) from corpus using python
我想从语料库中的每个文档中删除任何单个字符。 例如,假设有一些拼写错误或非英文字母。
corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
我试过的是
corpus=' '.join( [w for w in corpus.split() if len(w)>1] )
但没有用。 有人可以帮我吗?
试试下面的
corpus = ['I like this d hotel room because it was clean.', 'This hotel is very y close to downtown area.']
corpus1 = []
for entry in corpus:
corpus1.append(' '.join(x for x in entry.split() if len(x) > 1))
print(corpus1)
输出
['like this hotel room because it was clean.', 'This hotel is very close to downtown area.']
这应该适合你:
corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
clean_corpus=[]
for sentence in corpus:
clean_sentence=[]
parts=sentence.split(" ")
for part in parts:
invalid=False
if (len(part)==1) and (part.lower()!="a") and (part.lower()!="i") and (not part.isdigit()):
invalid=True
if not invalid:
clean_sentence.append(part)
clean_corpus.append(" ".join(clean_sentence))
print(clean_corpus)
这会清除所有不是“a”、“A”、“i”、“I”或数字(1、2、3……)的单字母单词。
自己尝试一下,并在评论中告诉我它是否有效或可以改进什么!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.