[英]How to build my training data in my case to train a SVM in classifier in scikit-learn?
I have sentences, coming from research studies, and there manually extracted word phrases, which are the key words of the sentences I want to have. 我有句子,来自研究,并且手动提取单词短语,这是我想要的句子的关键词。 Now to build the train data for a SVM classifier I would like to vectorize the sentences together with each keywords.
现在要为SVM分类器构建训练数据,我想将句子与每个关键词一起矢量化。 See code
见代码
I was thinking about a dictionary and the applying DictVectorizer from the sklearn-Library. 我正在考虑一个字典和从sklearn-Library应用DictVectorizer。
Code:
sklearn.feature_extraction import DictVectorizer
v = DictVectorizer()
D = [{"sentence":"the laboratory information system was evaluated",
"keyword":"laboratory information system"},
{"sentence":"the electronic health record system was evaluated",
"keyword":"electronic health record system"}]
X = v.fit_transform(D)
print(X)
content = X.toarray()
print(content)
print(v.get_feature_names())
Results:
(0, 1) 1.0
(0, 3) 1.0
(1, 0) 1.0
(1, 2) 1.0
[[0. 1. 0. 1.]
[1. 0. 1. 0.]]
['keyword=electronic health record system', 'keyword=laboratory information system', 'sentence=the electronic health record system was evaluated', 'sentence=the laboratory information system was evaluated']
Is this methodological correct or how can I bring together each sentence with the according manually extracted keyword for vectorizing to reveice the training data. 这种方法是否正确或如何将每个句子与相应的手动提取的关键字组合在一起进行矢量化以显示训练数据。 Thanks a lot.
非常感谢。
I think it's not ideal to do it this way as you are using the whole sentence as a feature. 我认为这样做是不理想的,因为你使用整个句子作为一个特征。 It'll become problematic for a large dataset.
对于大型数据集来说,它会成为问题。
For example, 例如,
D = [{"sentence":"This is sentence one",
"keyword":"key 1"},
{"sentence":"This is sentence one",
"keyword":"key 2"},
{"sentence":"This is sentence one",
"keyword":"key 3"},
{"sentence":"This is sentence one",
"keyword":"key 2"},
{"sentence":"This is sentence one",
"keyword":"key 1"}]
X
will be X
将是
[[1. 0. 0. 0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 1.]
[0. 0. 1. 0. 0. 0. 1. 0.]
[0. 1. 0. 0. 1. 0. 0. 0.]
[1. 0. 0. 1. 0. 0. 0. 0.]]
You could probably just apply TfidfVectorizer
from scikit-learn which'll probably pick up the important words in a sentence. 你可能只是从scikit-learn中应用
TfidfVectorizer
,这可能会在句子中找到重要的单词。
Code: 码:
from sklearn.feature_extraction.text import TfidfVectorizer
sentences = [d['sentence'] for d in D]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.