将 dataframe 列中已保存令牌的语料库转换为 gensim 字典时出错

Question

我将使用 NLTK 的标记化数据列表保存到只有一列的 csv 文件中。 稍后，我必须检索令牌并使用以下方法创建关键字字典：

       dictionary = gensim.corpora.Dictionary(column).

问题是，当我将标记化文件保存到 csv 中时，标记保存在单引号中，当我尝试检索它们并将 dataframe 列提供给 gensim 方法以创建字典时，它给出了字典需要的错误标记数组而不是字符串。 执行以下步骤：

脚步：

一列的csv文件为：

        Description                  
 0      Key moments included the DOJ description of the FBI affidavit used for the 
        search, warnings about chilling witnesses and inaction from Trump's attorney.
 1      Russian vehicles seen inside turbine hall at Ukraine nuclear plant.
 2      Finnish PM says videos of her partying shouldn't have been made public.
        .......
        and so on

我阅读了 csv 文件，然后使用以下方法标记数据：

           df = pd.read_csv('news_csv.csv', encoding='latin-1') 
           def tokenize(column):
             tokens = nltk.word_tokenize(column)
             return [w for w in tokens]

现在，我将 dataframe 标记化并再次将其保存到 csv 中。

      processedData = df['Description'].astype(str).map(tokenize)
      processedData.to_csv('nesws_tokenized.csv', header=True, index = False)

现在，我阅读该列并将其传递给 gensim 方法以创建关键字字典：

      df1 = pd.read_csv('news_tokenized.csv', encoding='latin-1')
      dictionary = gensim.corpora.Dictionary(df1)

当我运行它时，它给出了以下错误：

  TypeError: doc2bow expects an array of unicode tokens on input, not a single string

为什么我收到此错误？ 我相信数据作为逗号分隔的标记保存在列中，并且标记也是单引号。 单引号是否是一个问题，就好像没有保存标记化数据一样，如果我直接将处理的数据传递给 gensim 方法，它会创建字典，但当我将其保存到 csv 并再次检索时不会。

重要提示：我必须将它保存到 csv 文件中，因为数据集非常大，并且 Colab session 由于资源的充分利用和关键字的删除而崩溃，并且词形还原过程几乎占用了所有内存，因此我无法继续进行这就是为什么我必须将数据保存到 csv 文件中，然后启动一个新的 session 来完成任务。

Answer 1

您收到该错误，因为将令牌列表保存到 .csv 然后再次读取它们，导致列表被表示为字符串。

例如，您在处理数据中的第一个令牌列表如下所示：

['Key','moments','included','the','DOJ','description','of','the','FBI','affidavit','used','for','the','search',',','warnings','about','chilling','witnesses','and','inaction','from','Trump',"'s",'attorney','.']

但是，将其存储在.csv中并再次读取后，它改变了：

array(['[\'Key\', \'moments\', \'included\', \'the\', \'DOJ\', \'description\', \'of\', \'the\', \'FBI\', \'affidavit\', \'used\', \'for\', \'the\', \'search\', \',\', \'warnings\', \'about\', \'chilling\', \'witnesses\', \'and\', \'inaction\', \'from\', \'Trump\', "\'s", \'attorney\', \'.\']'],
      dtype=object)

它现在不再是一个字符串列表，而是一个包含一个元素的数组，它是一个字符串（包含原始列表）：

print(type(df1.iloc[0].values))
print(len(df1.iloc[0].values))
print(type(df1.iloc[0].values[0]))

Output：

<class 'numpy.ndarray'>
1
<class 'str'>

这个问题最简单的解决方案是首先不将数据存储在.csv 中，而是直接通过dictionary = gensim.corpora.Dictionary(processedData)使用它。 但是由于 colab session 的问题，您必须将其存储，因此您必须将行中的每个字符串作为字符串读取：

import ast

list_of_rows = []

for row in df1["Description"]:
  list_of_rows.append(ast.literal_eval(row))

#Put it into a pandas dataframe only for the visualization in Stackoverflow:
pd.DataFrame(list_of_rows)

Output：

指数	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
0	钥匙	时刻	包括	这	司法部	描述	的	这	联邦调查局	宣誓书	用过的	为了	这	搜索	,	警告	关于	不寒而栗	证人	和
1	俄语	车辆	见过	里面	涡轮	大厅	在	乌克兰	核	植物	.
2	芬兰	下午	说	视频	的	她	聚会	应该	不	有	到过	制成	上市	.

现在每一行再次表示为一个令牌列表，您可以形成您的 gensim 字典：

dictionary = gensim.corpora.Dictionary(list_of_rows)

测试：

print(dictionary)

Output：

Dictionary(46 unique tokens: ["'s", ',', '.', 'DOJ', 'FBI']...)

将 dataframe 列中已保存令牌的语料库转换为 gensim 字典时出错

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-08-19 06:56:19

将 dataframe 列中已保存令牌的语料库转换为 gensim 字典时出错

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-08-19 06:56:19

解决方案1
1 已采纳 2022-08-19 06:56:19