简体   繁体   中英

Using NLTK to tokeniz sentences to words using pandas

I'm trying to tokenize sentences from a csv file into words but my loop is not jumping to the next sentence its just doing first column. any idea where is the mistake ? this is how my CSV file look like在此处输入图片说明

import re
import string
import pandas as pd
text=pd.read_csv("data.csv")
from nltk.tokenize import word_tokenize
tokenized_docs=[word_tokenize(doc) for doc in text]
x=re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []

the output I'm getting is like this

在此处输入图片说明

which I expected to do for all sentences as a loop not just one.

You just need to change the code to grab the sentences:

import re
import string
import pandas as pd
text=pd.read_csv("out157.txt", sep="|")
from nltk.tokenize import word_tokenize
tokenized_docs=[word_tokenize(doc) for doc in text['SENTENCES']]
x=re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM