I am facing a problem with tagging words in sentences. I can not comprehend why I get as an output only the last sentence in the list. It should be a huge long list with tagged words.
Here is a full code here:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
#import stanfordnlp
import stanza
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
stanza.download('en')
# Load data
df = pd.read_excel('combined_file_neg.xlsx', engine='openpyxl', index_col=None)
print(df.head())
def clean_text(text):
# Convert the text into lowercase
text = text.lower()
# Split into list
wordList = text.split()
# Tokenize
#wordList = nltk.sent_tokenize(txt)
return " ".join(wordList)
df["clean_text"] = df["body"].apply(clean_text)
df['tokenized_sents'] = df.apply(lambda row: nltk.sent_tokenize(row['clean_text']), axis=1)
print(df["tokenized_sents"].head())
sentList = df.tokenized_sents.tolist()
print(sentList)
for doc in sentList:
for word in doc:
txt_list = nltk.word_tokenize(word)
taggedList = nltk.pos_tag(txt_list)
print(taggedList)
The output is below (the last sentence of the last document):
[('there', 'EX'), ('is', 'VBZ'), ('a', 'DT'), ('whole', 'JJ'), ('in', 'IN'), ('the', 'DT'), ('cream', 'NN'), ('as', 'IN'), ('someone', 'NN'), ('has', 'VBZ'), ('put', 'VBN'), ('his', 'PRP$'), ('finger', 'NN'), ('in', 'IN'), ('it', 'PRP'), ('and', 'CC'), ('the', 'DT'), ('pot', 'NN'), ('was', 'VBD'), ('messy', 'JJ'), ('with', 'IN'), ('cream', 'NN'), ('around', 'IN'), ('when', 'WRB'), ('it', 'PRP'), ('arrived.so', 'RB'), ('disappointing', 'VBG'), ('to', 'TO'), ('resend', 'VB'), ('returned', 'JJ'), ('items', 'NNS'), ('to', 'TO'), ('other', 'JJ'), ('buyers', 'NNS'), ('!', '.')]
The output should start with the tagged words in the sentence:
I love ememis but...this is probably the worst
The link to the excel file is here: https://www.dropbox.com/scl/fi/1m4guwwwncffmilgzswin/combined_file_neg.xlsx?dl=0&rlkey=d53o7px628zknqh9cf1enee40
Well that's because you re-initialize taggedList
on every iteration. Printing it on the next line will only print out the last iteration of taggedList
.
taggedList = []
for doc in sentList:
for word in doc:
txt_list = nltk.word_tokenize(word)
taggedList.append(nltk.pos_tag(txt_list))
print(taggedList)
You need to create a list taggedList
to contain keep it and append the post-tagged words to it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.