[英]NLTK: How to create a corpus from csv file
I have a csv file as我有一个 csv 文件
col1 col2 col3
some text someID some value
some text someID some value
in each row, col1 corresponds to the text of an entire document.在每一行中,col1 对应于整个文档的文本。 I would like to create a corpus from this csv.我想从这个 csv 创建一个语料库。 my aim is to use sklearn's TfidfVectorizer to compute document similarity and keyword extraction.我的目标是使用 sklearn 的 TfidfVectorizer 来计算文档相似度和关键字提取。 So consider所以考虑
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(<my corpus here>)
so then i can use那么我可以使用
str = 'here is some text from a new document'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print feature_names[col], ' - ', response[0, col]
how do i create a corpus using nltk?如何使用 nltk 创建语料库? what form/data structure should the corpus be so that it can be supplied to the transform function?语料库应该是什么形式/数据结构,以便它可以提供给转换函数?
Check out read_csv
from the pandas
library.退房read_csv
从pandas
库。 Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html这是文档: http : //pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
You can install pandas by running pip install pandas
at the command line.您可以通过在命令行运行pip install pandas
来安装 pandas。 Then loading the csv and selecting that column should be as easy as the below:然后加载 csv 并选择该列应该像下面一样简单:
data = pd.read_csv(path_to_csv)
docs = data['col1']
tfs = tfidf.fit_transform(docs)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.