I would like to remove any single characters from each document in corpus. For example, let's say there are some typos or non-English letters.
corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
What I've tried was
corpus=' '.join( [w for w in corpus.split() if len(w)>1] )
but didn't work. Could anyone help me out?
Try the below
corpus = ['I like this d hotel room because it was clean.', 'This hotel is very y close to downtown area.']
corpus1 = []
for entry in corpus:
corpus1.append(' '.join(x for x in entry.split() if len(x) > 1))
print(corpus1)
output
['like this hotel room because it was clean.', 'This hotel is very close to downtown area.']
This should work for you:
corpus=['I like this d hotel room because it was clean.','This hotel is very y close to downtown area.']
clean_corpus=[]
for sentence in corpus:
clean_sentence=[]
parts=sentence.split(" ")
for part in parts:
invalid=False
if (len(part)==1) and (part.lower()!="a") and (part.lower()!="i") and (not part.isdigit()):
invalid=True
if not invalid:
clean_sentence.append(part)
clean_corpus.append(" ".join(clean_sentence))
print(clean_corpus)
This cleans out all single letter words that are not "a", "A", "i", "I", or a digit (1, 2, 3, ...).
Try it yourself and tell me in the comments if it worked or what could be improved!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.