how to remove single character(letter) from corpus using python

Question

I would like to remove any single characters from each document in corpus. For example, let's say there are some typos or non-English letters.

corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']

What I've tried was

corpus=' '.join( [w for w in corpus.split() if len(w)>1] )

but didn't work. Could anyone help me out?

Answer 1

Try the below

corpus = ['I like this d hotel room because it was clean.', 'This hotel is very y close to downtown area.']
corpus1 = []
for entry in corpus:
    corpus1.append(' '.join(x for x in entry.split() if len(x) > 1))
print(corpus1)

output

['like this hotel room because it was clean.', 'This hotel is very close to downtown area.']

Answer 2

This should work for you:

corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
clean_corpus=[]
for sentence in corpus:
    clean_sentence=[]
    parts=sentence.split(" ")
    for part in parts:
        invalid=False
        if (len(part)==1) and (part.lower()!="a") and (part.lower()!="i") and (not part.isdigit()):
            invalid=True
        if not invalid:
            clean_sentence.append(part)
    clean_corpus.append(" ".join(clean_sentence))
print(clean_corpus)

This cleans out all single letter words that are not "a", "A", "i", "I", or a digit (1, 2, 3, ...).

Try it yourself and tell me in the comments if it worked or what could be improved!

how to remove single character(letter) from corpus using python

Question

2 answers

solution1
0 2020-11-07 19:54:32

solution2
0 2020-11-07 21:31:28

how to remove single character(letter) from corpus using python

Question

2 answers

solution1 0 2020-11-07 19:54:32

solution2 0 2020-11-07 21:31:28

solution1
0 2020-11-07 19:54:32

solution2
0 2020-11-07 21:31:28