准备数据以进行scikit学习

Question

我正在做一个关于作者身份归因的小型NLP项目：我有两位作者的文本，我想说说是谁写的。

我有一些预处理过的文本（带有标记的文本，带有pos标签的文本等），我想将其加载到sciki-learn中。

文档具有以下形状：

Testo   -   SPN Testo   testare+v+indic+pres+nil+1+sing testo+n+m+sing  O
:   -   XPS colon   colon+punc  O
"   -   XPO "   quotation_mark+punc O
Buongiorno  -   I   buongiorno  buongiorno+inter buongiorno+n+m+_   O
a   -   E   a   a+prep  O
tutti   -   PP  tutto   tutto+adj+m+plur+pst+ind tutto+pron+_+m+_+plur+ind  O
.   <eos>   XPS full_stop   full_stop+punc  O
Ci  -   PP  pro loc+pron+loc+_+3+_+clit pro+pron+accdat+_+1+plur+clit   O
sarebbe -   VI  essere  essere+v+cond+pres+nil+2+sing   O
molto   -   B   molto   molto+adj+m+sing+pst+ind

因此，它是一个由6列组成的制表符分隔的文本文件（单词，句子结尾标记，词性，引理，形态信息和命名的实体识别标记）。

每个文件都代表要分类的文档。

塑造scikit学习的最佳方法是什么？

Answer 1

他们在scikit-learn示例https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#中使用的结构在此处http://scikit-learn.org/stable/modules/generated/sklearn中进行了描述。 datasets.load_files.html

取代这个

# Load some categories from the training set
if opts.all_categories:
    categories = None
else:
    categories = [
        'alt.atheism',
        'talk.religion.misc',
        'comp.graphics',
        'sci.space',
    ]

if opts.filtered:
    remove = ('headers', 'footers', 'quotes')
else:
    remove = ()

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)

使用数据加载语句，例如：

# Load some categories from the training set
categories = [
        'high',
        'low',
]

print("loading dataset for categories:")
print(categories if categories else "all")

train_path='c:/Users/username/Documents/SciKit/train'
data_train = load_files(train_path, encoding='latin1')

test_path='c:/Users/username/Documents/SciKit/test'
data_test = load_files(test_path, encoding='latin1')

并在训练目录和测试目录的每个目录中为类别文件创建“高”和“低”子目录。

准备数据以进行scikit学习

问题描述

1 个解决方案

解决方案1
1 2015-10-04 23:32:58

准备数据以进行scikit学习

问题描述

1 个解决方案

解决方案1 1 2015-10-04 23:32:58

解决方案1
1 2015-10-04 23:32:58