简体   繁体   中英

Python Pandas: NLTK Part of Speech Tagging for Entire Column in Dataframe

I have the following sample data frame shown below. It has been tokenized already.

No  category    problem_definition_stopwords
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

I want to do part of speech tagging on this data frame. Below is the beginning of my code. It is erroring out:

from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer 

train_text = state_union.raw(df['problem_definition_stopwords'])

Error

TypeError: join() argument must be str or bytes, not 'list'

My desired result is below where 'XXX' is a tokenized word and after it is the part of speech (ie NNP):

[('XXX', 'NNP'), ('XXX', 'VBD'), ('XXX', 'POS')]

如果您要标记令牌并使用pos_tag获取POS,则将issue_definition_stopwords转换为字符串并传递给nltk.sent_tokenize。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM