简体   繁体   中英

nltk word_tokenize in Pandas DataFrame only returns tokens for the first 101 words/tokens

I'm trying to apply word_tokenization to a Pandas DataFrame column as the step before POS tagging. The source/raw column is 'sent' (already sentence-tokenized) and the destination column is 'word'. Here's the code, including the max column width instruction:

pd.set_option('display.max_colwidth', None)

LC_HD_df['word'] = LC_HD_df['sent'].apply (lambda x: nltk.tokenize.word_tokenize(str(x)))

This appears to work... except... Each cell in 'word' only has the first 101 tokens from the 'sent' cell. Why is it truncating at 101 tokens? How do I fix this?

The 101 words end with "..." does that suggest that they have been tokenized but do not appear for some reason? (That doesn't make sense.)

Attached is a picture of the first row.

One row, two columns, one with the source words, one with the 101 word tokens

I searched for related questions to no avail. Many questions related generally, but did not find one addressing the truncation problem. This should be an easy fix that I just don't know, but, once I know the solution, will never forget.

Thanks in advance for your assistance.

I don't think that your word cells only have 101 tokens in them, just that that many are being printed.

I assume your function nltk.tokenize.word_tokenize(str(x)) is a more elaborate version of x.split() . Taking a string and returning a list of strings.

To check the length of this list in each of the cells you could any of the methods mentioned in this post How to determine the length of lists in a pandas dataframe column eg.: LC_HD_df['word_count'] = LC_HD_df['word'].str.len()

I don't think you will come to 101 with this method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM