NLTK：如何从 csv 文件创建语料库

Question

I have a csv file as我有一个 csv 文件

col1         col2      col3

some text    someID    some value
some text    someID    some value

in each row, col1 corresponds to the text of an entire document.在每一行中，col1 对应于整个文档的文本。 I would like to create a corpus from this csv.我想从这个 csv 创建一个语料库。 my aim is to use sklearn's TfidfVectorizer to compute document similarity and keyword extraction.我的目标是使用 sklearn 的 TfidfVectorizer 来计算文档相似度和关键字提取。 So consider所以考虑

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(<my corpus here>)

so then i can use那么我可以使用

str = 'here is some text from a new document'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
    print feature_names[col], ' - ', response[0, col]

how do i create a corpus using nltk?如何使用 nltk 创建语料库？ what form/data structure should the corpus be so that it can be supplied to the transform function?语料库应该是什么形式/数据结构，以便它可以提供给转换函数？

Answer 1

Check out read_csv from the pandas library.退房read_csv从pandas库。 Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html这是文档： http : //pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

You can install pandas by running pip install pandas at the command line.您可以通过在命令行运行pip install pandas来安装 pandas。 Then loading the csv and selecting that column should be as easy as the below:然后加载 csv 并选择该列应该像下面一样简单：

data = pd.read_csv(path_to_csv)
docs = data['col1']

tfs = tfidf.fit_transform(docs)

NLTK：如何从 csv 文件创建语料库

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-12-12 04:07:32

NLTK：如何从 csv 文件创建语料库

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-12-12 04:07:32

解决方案1
2 已采纳 2015-12-12 04:07:32