简体繁体 English

如何制作1亿条推文的R tm语料库？

[英]How to make R tm corpus of 100 million tweets?

原文 2013-05-05 19:53:59 5 3 r/ hadoop/ amazon-ec2/ hive/ tm

I want to make a text corpus of 100 million tweets using R's distributed computing tm package (called tm.plugin.dc ). 我想使用R的分布式计算tm包（称为tm.plugin.dc ）创建一个包含1亿条推文的文本语料库。 The tweets are stored in a large MySQL table on my laptop. 这些推文存储在我笔记本电脑上的一个大型MySQL表中。 My laptop is old, so I am using a Hadoop cluster that I set up on Amazon EC2. 我的笔记本电脑很旧，所以我使用的是我在Amazon EC2上设置的Hadoop集群。

The tm.plugin.dc documentation from CRAN says that only DirSource is currently supported. CRAN的tm.plugin.dc文档说目前只支持DirSource。 The documentation seems to suggest that DirSource allows only one document per file. 文档似乎表明DirSource每个文件只允许一个文档。 I need the corpus to treat each tweet as a document. 我需要语料库将每条推文视为文档。 I have 100 million tweets -- does this mean I need to make 100 million files on my old laptop? 我有1亿条推文 - 这是否意味着我需要在旧笔记本上制作1亿个文件？ That seems excessive. 这似乎过分了。 Is there a better way? 有没有更好的办法？

What I have tried so far: 到目前为止我尝试过的：

Make a file dump of the MySQL table as a single (massive) .sql file. 将MySQL表的文件转储作为单个（大量）.sql文件。 Upload the file to S3. 将文件上传到S3。 Transfer the file from S3 to the cluster. 将文件从S3传输到群集。 Import the file into Hive using Cloudera's Sqoop tool. 使用Cloudera的Sqoop工具将文件导入Hive。 Now what? 怎么办？ I can't figure out how to make DirSource work with Hive. 我无法弄清楚如何使DirSource与Hive一起工作。
Make each tweet an XML file on my laptop. 在我的笔记本电脑上发送每条推文的XML文件。 But how? 但是怎么样？ My computer is old and can't do this well. 我的电脑很旧，无法做到这一点。 ... If I could get past that, then I would: Upload all 100 million XML files to a folder in Amazon's S3. ...如果我能够超越它，那么我会：将所有1亿个XML文件上传到亚马逊S3中的文件夹。 Copy the S3 folder to the Hadoop cluster. 将S3文件夹复制到Hadoop集群。 Point DirSource to the folder. 将DirSource指向该文件夹。

3 个解决方案

wouldn't be easier and more reasonable to make huge HDFS file with 100 million tweets and then process them by standard R' tm package? 用1亿条推文制作巨大的HDFS文件然后用标准的R'tm包处理它们会不会更容易也更合理？

This approach seems to me more natural since HDFS is developed for big files and distributed environment while R is great analytical tool but without parallelism (or limited). 这种方法在我看来更自然，因为HDFS是为大文件和分布式环境开发的，而R是很好的分析工具，但没有并行性（或有限）。 Your approach looks like using tools for something they were not developed for... 你的方法看起来像是使用工具来制作他们没有开发的东西......

I would strongly recommend to check this url http://www.quora.com/How-can-R-and-Hadoop-be-used-together . 我强烈建议您查看此网址http://www.quora.com/How-can-R-and-Hadoop-be-used-together 。 This will give you necessary insights to your problem. 这将为您提供必要的见解。

TM package basically works on term and document model. TM包基本上适用于术语和文档模型。 It creates a term document matrix or document term matrix. 它创建术语文档矩阵或文档术语矩阵。 This matrix contains features like term (word) and its frequency in the document. 该矩阵包含文档中的术语（单词）及其频率等功能。 Since you want to perform analysis on twitter data you should have each tweet as document and then you can created TDM or DTM. 由于您要对Twitter数据执行分析，您应该将每条推文作为文档，然后您可以创建TDM或DTM。 And can perform various analysis like finding associations, finding frequencies or clustering or calculating TDF-IDF measure etc. 并且可以执行各种分析，例如查找关联，查找频率或聚类或计算TDF-IDF度量等。

You need to build a corpus of directory source. 您需要构建目录源语料库。 So you need to have base directory which contains individual documents which is your tweet. 因此，您需要具有包含单个文档的基本目录，这是您的推文。

Depending on the OS you are using, What I would have done if windows will create .bat file or a simple javascript or java code to read the MySQL rows for the tweet file and FTP it a directory present on your local file system of Hadoop Box. 根据您使用的操作系统，如果Windows将创建.bat文件或简单的javascript或java代码来读取推文文件的MySQL行，并将其作为Hadoop Box本地文件系统上的目录FTP，我会怎么做？。

Once the files were FTP's we can copy the directory to HDFS by using Hadoop Copy From Local Command. 一旦文件是FTP，我们就可以使用Hadoop Copy From Local Command将目录复制到HDFS。