简体   繁体   English

拆分用R和Quanteda标记语料库

[英]Splitting tokenize a corpus with R and Quanteda

I am working on a project for NLP. 我正在为NLP做一个项目。 I need to take some blogs, news and tweets (you have probably heard of this capstone already) in .txt files and create n-grams frequencies. 我需要在.txt文件中获取一些博客,新闻和推文(您可能已经听说过这个顶点)并创建n-gram频率。

I did experiments on the steps to take the txt files to a frequencies data frame for analysis: 我做了一些实验,将txt文件带到频率data frame进行分析:

Read > Conver to corpus > Clean corpus > Tokenize > Convert to dfm > Convert to df

The bottle necks in the process were the tokenize and convert to dfm steps (over 5x more time). 过程中的瓶颈是标记化并转换为dfm步骤(超过5倍的时间)。

I had two choices: 我有两个选择:

1. Split the cleaned corpus to tokenize by piece
2. Split-read the .txt files from the beginning

No. 1 seemed the best, but so far I have not found a function or package that can do this in a way I want. 1号似乎是最好的,但到目前为止,我还没有找到能够以我想要的方式做到这一点的功能或包。 So I will write a long code to split-read from the beginning in 20 chunks (due to my computing constraints). 因此,我将编写一个长代码,从20个块开始拆分读取(由于我的计算限制)。

Is there a way I can split a corpus ("corpus" "list") created with the quanteda package in chunks (defined lines by me) so I can tokenize and turn to dfm in a "streaming" kinda way? 有没有一种方法可以拆分用quanteda包创建的语料库(“语料库”“列表”)(由我定义的行),这样我可以标记化并以“流式”方式转向dfm?

I think the package you will find most useful currently is the tm package. 我认为目前最有用的软件包是tm软件包。 It is a pretty complex but thorough package even though its still in an experimental state at version .7.1. 这是一个非常复杂但彻底的包,即使它仍处于版本.7.1的实验状态。 Without more detail I can't give you more exact usage info because it all depends on your sources, how you want to process the corpus and other factors. 如果没有更多细节,我无法向您提供更准确的使用信息,因为这完全取决于您的来源,您希望如何处理语料库以及其他因素。 The gist of what you'll need to do is first create a reader object dependant on your source material. 您需要做的就是首先根据源材料创建一个reader对象。 It can handle web input, plain texts, pdf and others. 它可以处理Web输入,纯文本,pdf等。 Then you can use one of the Corpus creation functions depending on whether you want to keep the whole thing in memory etc. You can then use the various 'tidying' functions to operate on the entire corpus as though each document were an element in a vector. 然后你可以使用一个语料库创建函数,这取决于你是否要将整个事物保存在内存等中。然后你可以使用各种“整理”函数来操作整个语料库,就像每个文档都是向量中的元素一样。 You can do the same with tokenizing. 你可以用标记化做同样的事情。 With a few more specifics we can give you more specific answers. 通过一些更具体的细节,我们可以为您提供更具体的答案。

Since this question hasn't been directly answered, I am reposting related content from the article I wrote in 2016 as a Community Mentor for the JHU Capstone, Capstone n-grams: how much processing power is required? 由于这个问题没有直接回答,我将重新发布我在2016年写的文章中的相关内容作为JHU Capstone的社区导师, Capstone n-gram:需要多少处理能力?

Overview 概观

Students in the Johns Hopkins University Data Science Specialization Capstone course typically struggle with the course project because of the amount of memory consumed by the objects needed to analyze text. 约翰斯·霍普金斯大学数据科学专业化课程的学生通常会对课程项目感到困惑,因为分析文本所需的对象占用了大量内存。 The question asks about the best approach for processing the 4+ million documents that are in the raw data files. 该问题询问了处理原始数据文件中的400多万个文档的最佳方法。 The short answer to this question is that it depends on the amount of RAM on one's machine. 对这个问题的简短回答是它取决于一台机器上的RAM数量。 Since R objects must reside in RAM, one must understand the amount of RAM consumed by the objects being processed. 由于R对象必须驻留在RAM中,因此必须了解正在处理的对象所消耗的RAM量。

A machine with 16Gb of RAM is required to process all of the data from the three files without processing it in smaller chunks or processing a random sample of data. 需要具有16Gb RAM的机器来处理来自三个文件的所有数据,而无需在较小的块中处理它或处理随机的数据样本。 My testing indicates that the working memory needed to process the files is approximately 1.5 - 3 times the size of the object output by the quanteda::tokens_ngrams() function from quanteda version 0.99.22, and therefore a 1 Gb tokenized corpus and consumes 9 Gb of RAM to generate a 4 Gb n-gram object. 我的测试表明,处理文件所需的工作内存大约是quanteda版本0.99.22中quanteda::tokens_ngrams()函数输出对象大小的1.5 - 3倍,因此是1 Gb标记化语料库并消耗9 RAM的Gb生成4 Gb n-gram对象。 Note that quanteda automatically uses multiple threads if your computer has multiple cores / threads. 请注意,如果您的计算机具有多个核心/线程, quanteda自动使用多个线程。

To help reduce the guesswork in the memory utilization, here is a summary of the amount of RAM consumed by objects required to analyze the files for the Swift Key sponsored capstone: predicting text. 为了帮助减少内存利用率的猜测,下面是分析Swift Key赞助的顶点文件所需的对象消耗RAM量的摘要:预测文本。

Raw data 原始数据

There are three raw data files used in the Capstone project. Capstone项目中使用了三个原始数据文件。 Once loaded into memory using a text processing function such as readLines() or readr::read_lines() , the resulting object sizes are as follows. 一旦使用文本处理函数(如readLines()readr::read_lines()加载到内存中,生成的对象大小如下所示。

  1. en_US.blogs.txt: 249 Mb en_US.blogs.txt:249 Mb
  2. en_US.news.txt: 250 Mb en_US.news.txt:250 Mb
  3. en_US.twitter.txt: 301 Mb en_US.twitter.txt:301 Mb

These files must be joined into a single object and converted to a corpus. 这些文件必须连接到一个对象并转换为语料库。 Together they consume about 800 Mb of RAM. 它们共同消耗大约800 Mb的RAM。

When converted to a corpus with quanteda::corpus() the resulting file size is 1.1 Gb in size. 当使用quanteda::corpus()转换为语料库时,生成的文件大小为1.1 Gb。

N-gram object sizes N-gram对象大小

To maximize the amount of RAM available for n-gram processing, once the corpus is generated, one must remove all objects from memory other than the tokenized corpus used as input to tokens_ngrams() . 为了最大化可用于n-gram处理的RAM量,一旦生成语料库,就必须从除了用作tokens_ngrams()输入的标记化语料库之外的内存中移除所有对象。 The object sizes for various n-grams is as follows. 各种n-gram的物体尺寸如下。

  1. 2-grams: 6.3 Gb 2克:6.3 Gb
  2. 3-grams: 6.5 Gb 3克:6.5 Gb
  3. 4-grams: 6.5 Gb 4克:6.5 Gb
  4. 5-grams: 6.3 Gb 5克:6.3 Gb
  5. 6-grams: 6.1 Gb 6克:6.1 Gb

Working with less memory 使用更少的内存

I was able to process a 25% sample of the capstone data on a MacBook Pro with 8 Gb of RAM, and a 5% sample on an HP Chromebook running Ubuntu Linux with 4 Gb of RAM. 我能够在具有8 Gb RAM的MacBook Pro上处理25%的顶点数据样本,在运行具有4 Gb RAM的Ubuntu Linux的HP Chromebook上处理5%的样本。 Adding to Ken Benoit's comment to the original question, one can assign a numeric group (eg repeating IDs of 1 - 20 to split to 20 groups) and then use the corpus_segment() function to split the corpus on group ID. 添加Ken Benoit对原始问题的评论,可以指定一个数字组(例如重复1到20的ID以分成20组),然后使用corpus_segment()函数在组ID上拆分语料库。 Then the groups can be processed individually via an apply() function to generate n-grams. 然后可以通过apply()函数单独处理组以生成n-gram。 A general process to generate all of the required n-grams is represented in the following pseudocode. 生成所有所需n-gram的一般过程在下面的伪代码中表示。

 for each group in the corpus
     for each size n-gram 
           1. generate n-grams
           2. write to file
           3. rm() n-gram object

This pseudocode can be implemented with a couple of apply() functions. 这个伪代码可以用几个apply()函数实现。

Object sizes: 25% sample 物体尺寸:25%样品

  1. 2-grams: 2.0 Gb 2克:2.0 Gb
  2. 3-grams: 2.9 Gb 3克:2.9 Gb
  3. 4-grams: 3.6 Gb 4克:3.6 Gb
  4. 5-grams: 3.9 Gb 5克:3.9 Gb
  5. 6-grams: 4.0 Gb 6克:4.0 Gb

Object sizes: 5% sample 物体尺寸:5%样品

  1. 2-grams: 492 Mb 2克:492 Mb
  2. 3-grams: 649 Mb 3克:649 Mb
  3. 4-grams: 740 Mb 4克:740 Mb
  4. 5-grams: 747 Gb 5克:747 Gb
  5. 6-grams: 733 Gb 6克:733 Gb

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM