繁体   English   中英

如何使用python从语料库中删除单个字符(字母)

[英]how to remove single character(letter) from corpus using python

我想从语料库中的每个文档中删除任何单个字符。 例如,假设有一些拼写错误或非英文字母。

corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']

我试过的是

corpus=' '.join( [w for w in corpus.split() if len(w)>1] )

但没有用。 有人可以帮我吗?

试试下面的

corpus = ['I like this d hotel room because it was clean.', 'This hotel is very y close to downtown area.']
corpus1 = []
for entry in corpus:
    corpus1.append(' '.join(x for x in entry.split() if len(x) > 1))
print(corpus1)

输出

['like this hotel room because it was clean.', 'This hotel is very close to downtown area.']

这应该适合你:

corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
clean_corpus=[]
for sentence in corpus:
    clean_sentence=[]
    parts=sentence.split(" ")
    for part in parts:
        invalid=False
        if (len(part)==1) and (part.lower()!="a") and (part.lower()!="i") and (not part.isdigit()):
            invalid=True
        if not invalid:
            clean_sentence.append(part)
    clean_corpus.append(" ".join(clean_sentence))
print(clean_corpus)

这会清除所有不是“a”、“A”、“i”、“I”或数字(1、2、3……)的单字母单词。

自己尝试一下,并在评论中告诉我它是否有效或可以改进什么!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM