I have a code which is supposed to pre-process a list of text documents. That is: Given a list of text documents, it returns a list with each text document pre-processed. But for some reason, it is not working to remove punctuation.
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download('punkt')
nltk.download('wordnet')
def preprocess(docs):
"""
Given a list of documents, return each documents as a string of tokens,
stripping out punctuation
"""
clean_docs = [clean_text(i) for i in docs]
tokenized_docs = [tokenize(i) for i in clean_docs]
return tokenized_docs
def tokenize(text):
"""
Tokenizes text -- returning the tokens as a string
"""
stop_words = stopwords.words("english")
nltk_tokenizer = nltk.WordPunctTokenizer().tokenize
tokens = nltk_tokenizer(text)
result = " ".join([i for i in tokens if not i in stop_words])
return result
def clean_text(text):
"""
Cleans text by removing case
and stripping out punctuation.
"""
new_text = make_lowercase(text)
new_text = remove_punct(new_text)
return new_text
def make_lowercase(text):
new_text = text.lower()
return new_text
def remove_punct(text):
text = text.split()
punct = string.punctuation
new_text = " ".join(word for word in text if word not in string.punctuation)
return new_text
# Get a list of titles
s1 = "[UPDATE] I am tired"
s2 = "I am cold."
clean_docs = preprocess([s1, s2])
print(clean_docs)
This prints out:
['[ update ] tired', 'cold .']
In other words, it does not strip out punctuation because "[", "]", and "." all appear in the final product.
You're trying to search a word in punctuation. Obviously [UPDATE]
is not a punctuation.
Try searching for punctuation in the text/replacing punctuation instead:
import string
def remove_punctuation(text: str) -> str:
for p in string.punctuation:
text = text.replace(p, '')
return text
if __name__ == '__main__':
text = '[UPDATE] I am tired'
print(remove_punctuation(text))
# output:
# UPDATE I am tired
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.