简体   繁体   中英

how to remove single character(letter) from corpus using python

I would like to remove any single characters from each document in corpus. For example, let's say there are some typos or non-English letters.

corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']

What I've tried was

corpus=' '.join( [w for w in corpus.split() if len(w)>1] )

but didn't work. Could anyone help me out?

Try the below

corpus = ['I like this d hotel room because it was clean.', 'This hotel is very y close to downtown area.']
corpus1 = []
for entry in corpus:
    corpus1.append(' '.join(x for x in entry.split() if len(x) > 1))
print(corpus1)

output

['like this hotel room because it was clean.', 'This hotel is very close to downtown area.']

This should work for you:

corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
clean_corpus=[]
for sentence in corpus:
    clean_sentence=[]
    parts=sentence.split(" ")
    for part in parts:
        invalid=False
        if (len(part)==1) and (part.lower()!="a") and (part.lower()!="i") and (not part.isdigit()):
            invalid=True
        if not invalid:
            clean_sentence.append(part)
    clean_corpus.append(" ".join(clean_sentence))
print(clean_corpus)

This cleans out all single letter words that are not "a", "A", "i", "I", or a digit (1, 2, 3, ...).

Try it yourself and tell me in the comments if it worked or what could be improved!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM