Storing list in a pandas DataFrame column

Question

I am trying to do some text processing using NLTK and Pandas.

I have DataFrame with column 'text'. I want to add column 'text_tokenized' that will be stored as a nested list.

My code for tokenizing text is:

def sent_word_tokenize(text):
    text = unicode(text, errors='replace')
    sents = sent_tokenize(text)
    tokens = map(word_tokenize, sents)

    return tokens

Currently, I am trying to apply this function as following:

df['text_tokenized'] = df.apply(lambda row: sent_word_tokenize(row.text), axis=1)

Which gives me error:

ValueError: Shape of passed values is (100, 3), indices imply (100, 21)

Not sure how to fix it and what is wrong here.

Answer 1

Solved my own question by using different axis:

Instead of:

df['text_tokenized'] = df.apply(lambda row: sent_word_tokenize(row.text), axis=1)

I used:

df['text_tokenized'] = df.text.apply(lambda text: sent_word_tokenize(text))

Although I am not sure why it works and I really appreciate if somebody could explain it to me.

Storing list in a pandas DataFrame column

Question

1 answers

solution1
2 ACCPTED 2016-08-02 03:38:45

Storing list in a pandas DataFrame column

Question

1 answers

solution1 2 ACCPTED 2016-08-02 03:38:45

solution1
2 ACCPTED 2016-08-02 03:38:45