繁体   English   中英

如何将numpy数组转换为常规python列表?

[英]How to convert a numpy array to regular python list?

所以我正在使用pandas从csv文件中获取输入,并使用nltk在其上执行标记化。 但是我收到以下错误:

Traceback (most recent call last):
  File "test.py", line 20, in <module>
    word = nltk.word_tokenize(words)
  File "/home/codelife/.local/lib/python3.5/site-packages/nltk/tokenize/__init__.py", line 109, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/home/codelife/.local/lib/python3.5/site-packages/nltk/tokenize/__init__.py", line 94, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/home/codelife/.local/lib/python3.5/site-packages/nltk/tokenize/punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/home/codelife/.local/lib/python3.5/site-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/home/codelife/.local/lib/python3.5/site-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/home/codelife/.local/lib/python3.5/site-packages/nltk/tokenize/punkt.py", line 1276, in <listcomp>
    return [(sl.start, sl.stop) for sl in slices]
  File "/home/codelife/.local/lib/python3.5/site-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/home/codelife/.local/lib/python3.5/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/home/codelife/.local/lib/python3.5/site-packages/nltk/tokenize/punkt.py", line 1289, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object

以下是代码:

from textblob import TextBlob
import nltk     #for cleaning and stop wrds removal
import pandas as pd     #csv
import numpy

data = pd.read_csv("Sample.csv", usecols=[0])   #reading from csv file
num_rows = data.shape[0]
#print(questions)

# cleaning the data 

count_nan = data.isnull().sum() #counting no of null elements   
count_without_nan = count_nan[count_nan==0] #no of not null elements
data = data[count_without_nan.keys()]   # removing null columns

data_mat = data.as_matrix(columns= None) #converting to numpy matrix
print(data_mat)
for question in data_mat:
    words = question.tolist()
    word = nltk.word_tokenize(words)
    print(word)

我以为是因为我正在使用numpy数组。 如何将其转换为常规python列表?

nltk的word_tokenize()函数需要获取单个字符串。 它将返回它包含的令牌的列表。 要将其应用于整个Python列表,numpy数组或pandas数据帧,您需要在Python中进行迭代(循环或理解)或使用numpy或pandas apply*方法。 例如,如果wordsnp.array ,则可以使用以下理解对其进行迭代。

sentences = [ nltk.word_tokenize(string) for string in words ]

如果还有其他字眼,您将需要改编代码或向我们展示问题中的内容。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM