简体   繁体   English

Hadoop Mapreduce CSV作为关键字:word

[英]Hadoop mapreduce CSV as key : word

I couldn't find the answer to my question, if there is a similar post please refer me there. 我找不到我的问题的答案,如果有类似的帖子,请在那儿转介我。

I have a CSV file that I am trying to perform a mapreduce on, the format of the CSV is two columns: Book Title | 我有一个要在其上执行mapreduce的CSV文件,CSV格式为两列:图书标题| Synopsis. 概要。 I want to be able to perform a mapreduce on each book and have a count for the words in each book, thus, I would like the output to be: Book Title : Token. 我希望能够对每本书进行mapreduce,并对每本书中的单词计数,因此,我希望输出为:图书标题:令牌。

So far, I have attempted to use the following code to achieve this: 到目前为止,我已经尝试使用以下代码来实现此目的:

    String firstBook = null;
    while (itr.hasMoreTokens()) {
        String secondBook = itr.nextToken();
        if (firstBook != null) {
              word.set(firstBook + ":" + secondBook);
              context.write(word, one);
        }
        firstBook = secondBook;
      } 

This sometimes outputs the following; 有时会输出以下内容; word : title 词:标题

In addition, it limits the analysis I can do as this is the logic I would like to use to perform an analysis of bigrams in each synopsis. 另外,它限制了我可以进行的分析,因为这是我想在每个摘要中执行双字母组分析的逻辑。

Is there a way that I can isolate each book title, just performing the mapreduce on the 'synopsis' column of the CSV? 有没有一种方法可以隔离每个书名,而只对CSV的“摘要”列执行mapreduce? If so, how would I do this and obtain the desired output? 如果是这样,我该怎么做并获得所需的输出?

Many thanks in advance. 提前谢谢了。

UPDATE UPDATE

The code is modified from Hadoops wordcount example, the only change is in the "map" section and is shown above. 该代码是从Hadoop的wordcount示例中修改的,唯一的变化是在“ map”部分,如上所示。 You can find the input data here . 您可以在此处找到输入数据。

Representation of the CSV File: CSV文件的表示形式:

Book title, Synopsis
A short history of nearly everything, Bill Byrson describes himself as a reluctant traveller...
Reclaiming economic development, There is no alternative to neoliberal economics - or so it appeared...

-> Note I have shortened the synopsis. ->请注意,我已缩短了摘要。

thus, I would like the output to be: Book Title : Token. 因此,我希望输出为:书名:令牌。

If you copied the word count example, you're only writing every two tokens followed by the number 1. It doesn't look like you're taking the titles, only the tokens of the synopsis. 如果复制单词计数示例,则只写每两个标记,后跟数字1。看起来好像带标题,只是大纲的标记。 But you've cut off the part where you get a tokenizer, so it's hard to tell. 但是您已经切断了获得标记器的部分,因此很难说清。

Note: If a book title contained commas, you'll end up with part of the title as part of the synopsis with your current approach. 注意:如果书名包含逗号,则作为当前方法摘要的一部分,您将获得部分书名。 If possible, you should make the title column quoted, or better, not use commas (or any other common delimiter) between the columns if that delimiter is going to be part of at least the first column. 如果可能的话,您应该将title列加引号,或者更好的是,如果该定界符至少将成为第一列的一部分,则不要在各列之间使用逗号(或任何其他常见的定界符)。

perform an analysis of bigrams in each synopsis. 对每个大纲中的双字母组进行分析。

If you want to do that type of analysis, I would recommend you clean up the columns first - remove capitalization and punctuation. 如果要进行这种类型的分析,建议您首先清理列-删除大写和标点符号。 Stemming of words might also produce better output. 词干可能还会产生更好的输出。

Is there a way that I can isolate each book title 有没有办法隔离每个书名

Sure, put an if statement for the first column targeting a specific book, and only write to the context in that condition 当然,在针对特定图书的第一列中放置一个if语句,并仅在该条件下写入上下文

Otherwise, if your mapper would write the book title only as the key, then they will be isolated as part of the reduce function 否则,如果您的映射器仅将书名写为键,那么它们将作为reduce函数的一部分被隔离

This was solved by using the "KeyValueTextInputFormat" class, there are several tutorials on here that specifically relate to this class. 这是通过使用“ KeyValueTextInputFormat”类解决的,这里有一些专门与此类相关的教程。 This allowed me to separate the CSV file, resulting in a key : value pair (in my case, book title : synopsis). 这使我可以分离CSV文件,从而生成一个key:value对(在我的情况下,是书名:synopsis)。 You can then perform a reduce as normal on the "value" and pass this through to the reduce stage as "key : token". 然后,您可以像平常一样对“值”执行缩减操作,并将其作为“键:令牌”传递到缩减阶段。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM