如何使用python从语料库中删除单个字符（字母）

Question

我想从语料库中的每个文档中删除任何单个字符。 例如，假设有一些拼写错误或非英文字母。

corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']

我试过的是

corpus=' '.join( [w for w in corpus.split() if len(w)>1] )

但没有用。 有人可以帮我吗？

Answer 1

试试下面的

corpus = ['I like this d hotel room because it was clean.', 'This hotel is very y close to downtown area.']
corpus1 = []
for entry in corpus:
    corpus1.append(' '.join(x for x in entry.split() if len(x) > 1))
print(corpus1)

输出

['like this hotel room because it was clean.', 'This hotel is very close to downtown area.']

Answer 2

这应该适合你：

corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
clean_corpus=[]
for sentence in corpus:
    clean_sentence=[]
    parts=sentence.split(" ")
    for part in parts:
        invalid=False
        if (len(part)==1) and (part.lower()!="a") and (part.lower()!="i") and (not part.isdigit()):
            invalid=True
        if not invalid:
            clean_sentence.append(part)
    clean_corpus.append(" ".join(clean_sentence))
print(clean_corpus)

这会清除所有不是“a”、“A”、“i”、“I”或数字（1、2、3……）的单字母单词。

自己尝试一下，并在评论中告诉我它是否有效或可以改进什么！

如何使用python从语料库中删除单个字符（字母）

问题描述

2 个解决方案

解决方案1
0 2020-11-07 19:54:32

解决方案2
0 2020-11-07 21:31:28

如何使用python从语料库中删除单个字符（字母）

问题描述

2 个解决方案

解决方案1 0 2020-11-07 19:54:32

解决方案2 0 2020-11-07 21:31:28

解决方案1
0 2020-11-07 19:54:32

解决方案2
0 2020-11-07 21:31:28