nltk.pos_tag and nltk.word_tokenize - list of lists

Question

I am facing a problem with tagging words in sentences. I can not comprehend why I get as an output only the last sentence in the list. It should be a huge long list with tagged words.

Here is a full code here:


import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
#import stanfordnlp
import stanza

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
stanza.download('en')

# Load data
df = pd.read_excel('combined_file_neg.xlsx', engine='openpyxl', index_col=None)

print(df.head())

def clean_text(text):
    # Convert the text into lowercase
    text = text.lower()
    # Split into list
    wordList = text.split()
    # Tokenize
    #wordList = nltk.sent_tokenize(txt)
    return " ".join(wordList)

df["clean_text"] = df["body"].apply(clean_text)

df['tokenized_sents'] = df.apply(lambda row: nltk.sent_tokenize(row['clean_text']), axis=1)

print(df["tokenized_sents"].head())

sentList = df.tokenized_sents.tolist()

print(sentList)

for doc in sentList:
    for word in doc:
        txt_list = nltk.word_tokenize(word)
        taggedList = nltk.pos_tag(txt_list)
print(taggedList)

The output is below (the last sentence of the last document):

[('there', 'EX'), ('is', 'VBZ'), ('a', 'DT'), ('whole', 'JJ'), ('in', 'IN'), ('the', 'DT'), ('cream', 'NN'), ('as', 'IN'), ('someone', 'NN'), ('has', 'VBZ'), ('put', 'VBN'), ('his', 'PRP$'), ('finger', 'NN'), ('in', 'IN'), ('it', 'PRP'), ('and', 'CC'), ('the', 'DT'), ('pot', 'NN'), ('was', 'VBD'), ('messy', 'JJ'), ('with', 'IN'), ('cream', 'NN'), ('around', 'IN'), ('when', 'WRB'), ('it', 'PRP'), ('arrived.so', 'RB'), ('disappointing', 'VBG'), ('to', 'TO'), ('resend', 'VB'), ('returned', 'JJ'), ('items', 'NNS'), ('to', 'TO'), ('other', 'JJ'), ('buyers', 'NNS'), ('!', '.')]

The output should start with the tagged words in the sentence:

I love ememis but...this is probably the worst

The link to the excel file is here: https://www.dropbox.com/scl/fi/1m4guwwwncffmilgzswin/combined_file_neg.xlsx?dl=0&rlkey=d53o7px628zknqh9cf1enee40

Answer 1

Well that's because you re-initialize taggedList on every iteration. Printing it on the next line will only print out the last iteration of taggedList .

taggedList = []

for doc in sentList:
    for word in doc:
        txt_list = nltk.word_tokenize(word)
        taggedList.append(nltk.pos_tag(txt_list))
print(taggedList)

You need to create a list taggedList to contain keep it and append the post-tagged words to it.

nltk.pos_tag and nltk.word_tokenize - list of lists

Question

1 answers

solution1
0 ACCPTED 2021-05-16 15:25:07

nltk.pos_tag and nltk.word_tokenize - list of lists

Question

1 answers

solution1 0 ACCPTED 2021-05-16 15:25:07

solution1
0 ACCPTED 2021-05-16 15:25:07