具有相似单词分组的Hadoop MapReduce Word（Hashtag）计数不起作用

Question

I've been trying to create a Twitter Hashtag-Count Hadoop program. 我一直在尝试创建一个Twitter Hashtag-Count Hadoop程序。 I've successfully extracted the text, gotten the hashtags and started trying to count them. 我已经成功提取了文本，获取了标签并开始尝试对它们进行计数。 One of the earliest problems I encountered is that many hashtags are extremely similar (test, tests, tests!, T-est,etc.). 我遇到的最早的问题之一是许多标签非常相似（测试，测试，测试！，T-est等）。

I started by clearing the String of all special characters and removing all spaces inside the hashtag. 首先，清除所有特殊字符的String并删除主题标签内的所有空格。 But the problem persisted when there were cases like ("hawk","hawk","hawks") and so on. 但是，当出现类似“鹰”，“鹰”，“鹰”之类的情况时，问题仍然存在。 I implemented Dice's Coefficient algorithm in a separate class as follow: 我在一个单独的类中实现了Dice系数算法，如下所示：

//Using Dice's Coefficient algorithm
public class WordSimilarity {


    public static boolean isStringSimilar(String str1,String str2){
        return doComparison(str1,str2)>=Analyzer.getSimilarity();
    }

    /** @return lexical similarity value in the range [0,1] */
    private static double doComparison(String str1, String str2) {
        // If the strings are too small, do not compare them at all.
        try {
            if(str1.length()>3 && str2.length()>3) {
                ArrayList pairs1 = wordLetterPairs(str1.toUpperCase());
                ArrayList pairs2 = wordLetterPairs(str2.toUpperCase());
                int intersection = 0;
                int union = pairs1.size() + pairs2.size();
                for (int i = 0; i < pairs1.size(); i++) {
                    Object pair1 = pairs1.get(i);
                    for (int j = 0; j < pairs2.size(); j++) {
                        Object pair2 = pairs2.get(j);
                        if (pair1.equals(pair2)) {
                            intersection++;
                            pairs2.remove(j);
                            break;
                        }
                    }
                }
                return (2.0 * intersection) / union;
            }
            else{
                return 0;
            }
        }catch(NegativeArraySizeException ex){

            return 0;
        }
    }


    /** @return an ArrayList of 2-character Strings. */
    private static ArrayList wordLetterPairs(String str){
        ArrayList allPairs = new ArrayList();
        // Tokenize the string and put the tokens/words into an array
        String[] words = str.split("\\s");
        // For each word
        for(int w=0; w<words.length;w++){
            // Find the pairs of characters
            String[] pairsInWord = letterPairs(words[w]);
            for(int p=0;p<pairsInWord.length;p++){
                allPairs.add(pairsInWord[p]);
            }
        }
        return allPairs;
    }

    /** @return an array of adjacent letter pairs contained in the input string */
    private static String[] letterPairs(String str){
        int numPairs = str.length() -1;
        String[] pairs = new String[numPairs];
        for(int i=0; i<numPairs;i++){
            pairs[i]=str.substring(i,i+2);
        }
        return pairs;
    }

}

tl;dr Compare two words and return a number between 0 and 1 of how similar those String are. tl; dr比较两个单词并返回介于0和1之间的数字，这些数字与String相似。

I then created a custom WritableComparable (I intended to use this as a value along the project, though it is only key for now.): 然后，我创建了一个自定义WritableComparable（我打算将其用作项目中的值，尽管目前仅是关键。）：

public class Hashtag implements WritableComparable<Hashtag> {

    private Text hashtag;

    public Hashtag(){
        this.hashtag = new Text();
    }

    public Hashtag(String hashtag) {
        this.hashtag = new Text(hashtag);
    }

    public Text getHashtag() {
        return hashtag;
    }

    public void setHashtag(String hashtag) {
        // Remove characters that add no information to the analysis, but cause problems to the result
        this.hashtag = new Text(hashtag);
    }

    public void setHashtag(Text hashtag) {
        this.hashtag = hashtag;
    }

    // Compare To uses the WordSimilarity algorithm to determine if the hashtags are similar. If they are,
    // they are considered equal
    @Override
    public int compareTo(Hashtag o) {
        if(o.getHashtag().toString().equalsIgnoreCase(this.getHashtag().toString())){
            return 0;
        }else if(WordSimilarity.isStringSimilar(this.hashtag.toString(),o.hashtag.toString())){
            return 0;
        }else {
            return this.hashtag.toString().compareTo(o.getHashtag().toString());
        }
    }

    @Override
    public String toString() {
        return this.hashtag.toString();
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        this.hashtag.write(dataOutput);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.hashtag.readFields(dataInput);
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof Hashtag)) return false;
        Hashtag hashtag1 = (Hashtag) o;
        return WordSimilarity.isStringSimilar(this.getHashtag().toString(),hashtag1.getHashtag().toString());
    }

    @Override
    public int hashCode() {
        return Objects.hash(getHashtag());
    }

}

And finally, written the MapReduce Code: 最后，编写了MapReduce代码：

public class HashTagCounter {

    private final static IntWritable one = new IntWritable(1);

    public static class HashtagCountMapper extends Mapper<Object, Text, Hashtag, IntWritable> {

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            //If the line does not start with '{', it is not a valid JSON. Ignore.
            if (value.toString().startsWith("{")) {
                Status tweet = null;
                try {
                    //Create a status object from Raw JSON
                    tweet = TwitterObjectFactory.createStatus(value.toString());
                    if (tweet!=null && tweet.getText() != null) {
                        StringTokenizer itr = new StringTokenizer(tweet.getText());
                        while (itr.hasMoreTokens()) {
                            String temp = itr.nextToken();
                            //Check only hashtags
                            if (temp.startsWith("#") && temp.length()>=3 &&  LanguageChecker.checkIfStringIsInLatin(temp)){
                                temp = purifyString(temp);
                                context.write(new Hashtag('#'+temp), one);
                            }
                        }
                    }
                } catch (TwitterException tex) {
                    System.err.println("Twitter Exception thrown: "+ tex.getErrorMessage());
                }
            }
        }
    }

    public static class HashtagCountCombiner extends Reducer<Hashtag, IntWritable, Hashtag, IntWritable> {

        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Hashtag key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static class HashtagCountReducer extends Reducer<Hashtag, IntWritable, Hashtag, IntWritable> {

        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Hashtag key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    private static String purifyString(String s){
        s = s.replaceAll(Analyzer.PURE_TEXT.pattern(),"").toLowerCase();
        s  = Normalizer.normalize(s, Normalizer.Form.NFD)
                .replaceAll("[^\\p{ASCII}]", "");
        return s.trim();
    }
}

Please note, all imports are in place in code, I just ommited them here to reduce an already text-heavy post. 请注意，所有导入都在代码中到位，我只是在这里省略了它们，以减少已经很文本的帖子。

The code runs normally, with no errors, and it mostly works. 该代码正常运行，没有任何错误，并且大多数情况下都有效。 I say mostly, because in the part-r-0000 file I get several entries like this: 我主要是说，因为在part-r-0000文件中，我得到了几个像这样的条目：

milwauke 2 密尔沃基2
XXXXXXXX <----some other strings XXXXXXXX <----其他一些字符串
milwauke 1 密尔沃基1
XXXXXXXX <----some other strings XXXXXXXX <----其他一些字符串
milwauke 7 密尔沃基7

and so on. 等等。 I tested those strings on a notepad, and they appear perfectly identical (I originally thought it could be an encoding issue. Not the case, all such hashtags in the original file show as UTF8). 我在记事本上测试了这些字符串，它们看起来完全相同（我本来以为可能是编码问题。并非如此，原始文件中的所有此类标签都显示为UTF8）。

It does not happen for all hashtags, but it happens for quite some few. 并非所有主题标签都发生这种情况，但相当一部分情况发生。 I could theoretically run a second Mapreduce job on the output, and combine them properly and without hassle (we are talking about a 100kb file produced from a 10GB input file), but I believe this is a waste of computing power. 从理论上讲，我可以在输出上运行第二个Mapreduce作业，并将其正确组合而没有麻烦（我们正在谈论从10GB的输入文件生成的100kb的文件），但是我相信这是在浪费计算能力。

This has lead me to believe that I was missing something in how MapReduce works. 这使我相信我在MapReduce的工作方式中缺少了一些东西。 It's driving me crazy. 这让我疯狂。 Can anyone explain to me what I am doing so wrong, where the error in my logic is? 谁能向我解释我做错了什么，我的逻辑错误在哪里？

Answer 1

I guess HashTag implementation is causing the issue. 我猜HashTag实现导致了这个问题。 Text and String differs when it encounters double byte character from a sequence of UTF-8 characters. 当文本和字符串遇到UTF-8字符序列中的双字节字符时，会有所不同。 Moreover Text is mutable and String is not, and also the expected behavior with String manipulation may not be same with Text manipulation.. 此外，Text是可变的，而String是可变的，并且String操纵的预期行为可能与Text操纵不同。

so just read just 4 pages [115, 118] (both inclusive) from the below link which will open a pdf file pointing to hadoop-the definitive guide.. 因此，只需从下面的链接中阅读4页[115、118]（包括两者），即可打开指向Hadoop权威指南的pdf文件。

http://javaarm.com/file/apache/Hadoop/books/Hadoop-The.Definitive.Guide_4.edition_a_Tom.White_April-2015.pdf http://javaarm.com/file/apache/Hadoop/books/Hadoop-The.Definitive.Guide_4.edition_a_Tom.White_April-2015.pdf

Hope this read might help you to resolve the exact issue.. 希望这篇文章可以帮助您解决确切的问题。

Thanks.. 谢谢..

具有相似单词分组的Hadoop MapReduce Word（Hashtag）计数不起作用

问题描述

1 个解决方案

解决方案1
0 2018-01-11 03:25:28

具有相似单词分组的Hadoop MapReduce Word（Hashtag）计数不起作用

问题描述

1 个解决方案

解决方案1 0 2018-01-11 03:25:28

解决方案1
0 2018-01-11 03:25:28