简体   繁体   English

NLTK:如何从 csv 文件创建语料库

[英]NLTK: How to create a corpus from csv file

I have a csv file as我有一个 csv 文件

col1         col2      col3

some text    someID    some value
some text    someID    some value

in each row, col1 corresponds to the text of an entire document.在每一行中,col1 对应于整个文档的文本。 I would like to create a corpus from this csv.我想从这个 csv 创建一个语料库。 my aim is to use sklearn's TfidfVectorizer to compute document similarity and keyword extraction.我的目标是使用 sklearn 的 TfidfVectorizer 来计算文档相似度和关键字提取。 So consider所以考虑

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(<my corpus here>)

so then i can use那么我可以使用

str = 'here is some text from a new document'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
    print feature_names[col], ' - ', response[0, col]

how do i create a corpus using nltk?如何使用 nltk 创建语料库? what form/data structure should the corpus be so that it can be supplied to the transform function?语料库应该是什么形式/数据结构,以便它可以提供给转换函数?

Check out read_csv from the pandas library.退房read_csvpandas库。 Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html这是文档: http : //pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

You can install pandas by running pip install pandas at the command line.您可以通过在命令行运行pip install pandas来安装 pandas。 Then loading the csv and selecting that column should be as easy as the below:然后加载 csv 并选择该列应该像下面一样简单:

data = pd.read_csv(path_to_csv)
docs = data['col1']

tfs = tfidf.fit_transform(docs)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM