I am trying to do some text processing using NLTK and Pandas.
I have DataFrame with column 'text'. I want to add column 'text_tokenized' that will be stored as a nested list.
My code for tokenizing text is:
def sent_word_tokenize(text):
text = unicode(text, errors='replace')
sents = sent_tokenize(text)
tokens = map(word_tokenize, sents)
return tokens
Currently, I am trying to apply this function as following:
df['text_tokenized'] = df.apply(lambda row: sent_word_tokenize(row.text), axis=1)
Which gives me error:
ValueError: Shape of passed values is (100, 3), indices imply (100, 21)
Not sure how to fix it and what is wrong here.
Solved my own question by using different axis:
Instead of:
df['text_tokenized'] = df.apply(lambda row: sent_word_tokenize(row.text), axis=1)
I used:
df['text_tokenized'] = df.text.apply(lambda text: sent_word_tokenize(text))
Although I am not sure why it works and I really appreciate if somebody could explain it to me.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.