简体   繁体   English

列表中的NLTK语料库类别

[英]NLTK Corpus categories from lists

I am trying to build an NLTK corpus using information from Pubmed. 我正在尝试使用来自Pubmed的信息来构建NLTK语料库。

In my first attempt, I successfully built a small function to retrieve the data using the Entrez package, got the retrieved article titles (a list of strings, ie the titles) into a corpus of files (each title as a new file) and created a corpus using each 'fileid' (ie the filename) as the category of the document. 在我的第一次尝试中,我成功地构建了一个小函数,使用Entrez包检索数据,将检索到的文章标题(字符串列表,即标题)放入文件库中(每个标题作为新文件)并创建了使用每个“文件名”(即文件名)作为文档类别的语料库。

Now I have to step up the game: each document of the corpus needs to have a title, an abstract and the respective MeSH terms (these last need to define the categories of the corpus, instead of those being defined by the name of the document). 现在,我必须加紧游戏:语料库的每个文档都需要有一个标题,摘要和相应的MeSH术语(这些最后一个需要定义语料库的类别,而不是由文档名称来定义)。

So now I have a few problems that I don't really see how to resolve. 所以现在我有一些问题,我真的看不出来如何解决。 I will start backwards, as it may be easier to understand: 我将倒退,因为它可能更容易理解:

1) My corpus reader goes as follows: 1)我的语料库阅读器如下:

corpus = CategorizedPlaintextCorpusReader(corpus_root, file_pattern,
                                      cat_pattern=r'(\w+)_.*\.txt')

where 'cat_pattern' is a regular expression for extracting the category names from the fileids arguments, ie the names of the files. 其中“ cat_pattern”是一个正则表达式,用于从fileids参数中提取类别名称,即文件名。 But now I need to get these categories from the MeSH terms within the file, which leads to the next problem: 但是现在我需要从文件中的MeSH术语中获取这些类别,这将导致下一个问题:

2) the Pubmed query retrieves a batch of information, from where I first took only the titles (the ones that I would use to generate the corpus), but now I need to retrieve the titles, the abstract, and the MeSH terms. 2)Pubmed查询检索了一批信息,我仅从那里获取标题(我将用来生成语料库的标题),但是现在我需要检索标题,摘要和MeSH术语。

The pseudo-code would be something as follows: 伪代码将如下所示:

papers = [] 

'Papers' is a list containing all the articles retrieved, as well as all the information related to the articles. “纸张”是一个列表,其中包含检索到的所有文章以及与文章相关的所有信息。 Let's say I then have: 假设我有:

out = []
for each in range(0, len(papers)):
    out.append(papers[each]['TI'])
    out.append(papers[each]['AB'])
    out.append(papers[each]['MH'])

That last part of the list, the ['MH'] (the list of MeSH terms), is what I need to use to define the categories of the corpus. 列表的最后一部分['MH'](MeSH术语列表)是我需要用来定义语料库类别的内容。

3) After I build the corpus with these 3 pieces of information, to be able to use my classifier, I also need to somehow transform all this batch of information into this: 3)在使用这3条信息构建语料库之后,为了能够使用我的分类器,我还需要以某种方式将所有这批信息转换为以下信息:

# X: a list or iterable of raw strings, each representing a document.
X = [corpus.raw(fileid) for fileid in corpus.fileids()]

Remembering that "fileid" is each of the documents of the corpus. 请记住,“ fileid”是语料库的每个文档。 This is the code from the first prototype, where each document was composed of a single string (the title), and that now each "document" must have the title (['TI']), the abstract (['AB']), and the MeSH terms (['MH'] - this one I'm not sure, because of the next code:) 这是第一个原型的代码,其中每个文档都由一个字符串(标题)组成,并且现在每个“文档”都必须具有标题(['TI']),摘要(['AB'] )和MeSH术语(['MH']-我不确定,因为下一个代码:)

# y: a list or iterable of labels, which will be label encoded.
y = [corpus.categories(fileid)[0] for fileid in corpus.fileids()]

Here, the y represents the labels, which were the filenames, and now I need the labels to be the MeSH terms. 在这里,y表示标签,即文件名,现在我需要将这些标签用作MeSH术语。

I don't know how to make this happen, or even if this is possible as far as my knowledge goes, and yes I did search and read the NLTK book tutorials, many pages on how to build NLTK corpora, etc etc..., but nothing seems to fit what I intend to do. 我不知道如何做到这一点,或者就我所知,即使这是可能的,是的,我确实搜索并阅读了NLTK书籍教程,有关如何构建NLTK语料库的许多页面,等等。 ,但似乎与我的计划不符。

This may be very confusing, but let me know if you need me to rephrase anything. 这可能会很令人困惑,但是如果您需要我重新措辞,请告诉我。 Any help would be appreciated :) 任何帮助,将不胜感激 :)

The cat_pattern argument is convenient when the category can be determined from the filename, but in your case it is not enough. 当可以从文件名确定类别时,使用cat_pattern参数很方便,但在您的情况下还不够。 Fortunately there are other ways to specify file categories. 幸运的是,还有其他方法可以指定文件类别。 Write an ad hoc program to figure out the categories of each file in your corpus, and store the results in a file corpus_categories (or whatever; just make sure the name doesn't match the corpus filename pattern, so that you can place it in the corpus folder). 编写一个临时程序来找出语料库中每个文件的类别,并将结果存储在文件corpus_categories (或其他任何形式;只需确保名称与语料库文件名模式不匹配,以便将其放置在语料库文件夹)。 Then initialize your reader with cat_file="corpus_categories" instead of cat_pattern . 然后使用cat_file="corpus_categories"而不是cat_pattern初始化您的阅读器。

corpus = CategorizedPlaintextCorpusReader(
                           corpus_root, 
                           file_pattern,
                           cat_file="corpus_categories")

Each line in the category file should have a filename and its category or categories, separated by spaces. 类别文件中的每一行都应有一个文件名及其类别,以空格分隔。 Here's a snippet from cats.txt for the reuters corpus: 以下是cats.txtreuters语料的摘录:

training/196 earn
training/197 oat corn grain
training/198 money-supply
training/199 acq
training/200 soy-meal soy-oil soybean meal-feed oilseed veg-oil

I've no idea what you're trying to accomplish in your question 3, but it seems pretty clear that it's unrelated to creating the categorized corpus (and hence you should ask it as a separate question). 我不知道您要在问题3中完成什么,但是很显然,这与创建分类语料库无关(因此,您应该将其作为一个单独的问题提出)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM