简体   繁体   English

scikit-learn:如何在由一个列表组成的嵌套列表上使用 RandomizedSearchCV?

[英]scikit-learn: How to use RandomizedSearchCV on a nested list consisting of one list?

I have built a Sentence Boundary Detection Classifier.我已经建立了一个句子边界检测分类器。 For the sequence labeling I used a conditional random field.对于序列标记,我使用了条件随机场。 For the hyperparameter optimization I would like to use RandomizedSearchCV.对于超参数优化,我想使用 RandomizedSearchCV。 My training data consists of 6 annotated texts.我的训练数据包含 6 个带注释的文本。 I merge all 6 texts to a tokenlist.我将所有 6 个文本合并到一个令牌列表。 For the implementation I followed an example from the documentation .对于实现,我遵循了文档中的示例。 Here my simplified code:这是我的简化代码:

from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
import scipy.stats

#my tokenlist has the length n
X_train = [feature_dict_token_1, ... , feature_dict_token_n]
# 3 types of tags, B-SEN for begin of sentence; E-SEN for end of sentence; O-Others
y_train = [tag_token_1, ..., tag_token_n]

# define fixed parameters and parameters to search
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

labels = ['B-SEN', 'E-SEN', 'O']

# use F1-score for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit([X_train], [y_train])

I used rs.fit([X_train], [y_train]) instead of rs.fit(X_train, y_train) since the documentation of crf.train says, that it needs a list of lists:我使用rs.fit([X_train], [y_train])而不是rs.fit(X_train, y_train)因为 crf.train 的文档说它需要一个列表列表:

fit(X, y, X_dev=None, y_dev=None)

Parameters: 
-X (list of lists of dicts) – Feature dicts for several documents (in a python-crfsuite format).
-y (list of lists of strings) – Labels for several documents.
-X_dev ((optional) list of lists of dicts) – Feature dicts used for testing.
-y_dev ((optional) list of lists of strings) – Labels corresponding to X_dev.

But using a list of lists I get this Error:但是使用列表列表我得到这个错误:

ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=1

I understand that it is because I use [X_train] and [y_train] respectively and it is not possible to apply CV to a list consisting of one list, but with X_train and y_train crf.fit does not cope.我知道这是因为我分别使用 [X_train] 和 [y_train] 并且无法将 CV 应用于由一个列表组成的列表,但是使用 X_train 和 y_train crf.fit 无法应对。 How can i fix this?我怎样才能解决这个问题?

According to the official tutorialhere , your train/test sets (ie, X_train , X_test ) should be a list of lists of dictionaries.根据此处的官方教程,您的训练/测试集(即X_trainX_test )应该是字典列表的列表。 For example:例如:

[[{'bias': 1.0,
   'word.lower()': 'melbourne',
   'word[-3:]': 'rne',
   'word[-2:]': 'ne',
   'word.isupper()': False,
   'word.istitle()': True,
   'word.isdigit()': False,
   'postag': 'NP'},
  {'bias': 1.0,
   'word.lower()': '(',
   'word[-3:]': '(',
   'word[-2:]': '(',
   'word.isupper()': False,
   'word.istitle()': False,
   'word.isdigit()': False,
   'postag': 'Fpa'},
   ...],
    [{'bias': 1.0,
   'word.lower()': '-',
   'word[-3:]': '-',
   'word[-2:]': '-',
   'word.isupper()': False,
   'word.istitle()': False,
   'word.isdigit()': False,
   'postag': 'Fg',
   'postag[:2]': 'Fg'},
    {'bias': 1.0,
   'word.lower()': '25',
   'word[-3:]': '25',
   'word[-2:]': '25',
   'word.isupper()': False,
   'word.istitle()': False,
   'word.isdigit()': True,
   'postag': 'Z'
   }]]

The labels sets (ie, y_tain and y_test) should be a list of lists of strings.标签集(即y_tainy_test)应该是字符串列表的列表。 For instance:例如:

[['B-LOC', 'I-LOC'], ['B-ORG', 'O']]

Then you fit the model as normally:然后像往常一样安装 model:

rs.fit(X_train, y_train)

Please take the tutorial mentioned above to see how that works.请参考上面提到的教程,看看它是如何工作的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM