简体   繁体   中英

Storing list in a pandas DataFrame column

I am trying to do some text processing using NLTK and Pandas.

I have DataFrame with column 'text'. I want to add column 'text_tokenized' that will be stored as a nested list.

My code for tokenizing text is:

def sent_word_tokenize(text):
    text = unicode(text, errors='replace')
    sents = sent_tokenize(text)
    tokens = map(word_tokenize, sents)

    return tokens

Currently, I am trying to apply this function as following:

df['text_tokenized'] = df.apply(lambda row: sent_word_tokenize(row.text), axis=1)

Which gives me error:

ValueError: Shape of passed values is (100, 3), indices imply (100, 21)

Not sure how to fix it and what is wrong here.

Solved my own question by using different axis:

Instead of:

df['text_tokenized'] = df.apply(lambda row: sent_word_tokenize(row.text), axis=1)

I used:

df['text_tokenized'] = df.text.apply(lambda text: sent_word_tokenize(text))

Although I am not sure why it works and I really appreciate if somebody could explain it to me.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM