简体   繁体   中英

Hadoop MapReduce Word (Hashtag) count with similar word grouping not working

I've been trying to create a Twitter Hashtag-Count Hadoop program. I've successfully extracted the text, gotten the hashtags and started trying to count them. One of the earliest problems I encountered is that many hashtags are extremely similar (test, tests, tests!, T-est,etc.).

I started by clearing the String of all special characters and removing all spaces inside the hashtag. But the problem persisted when there were cases like ("hawk","hawk","hawks") and so on. I implemented Dice's Coefficient algorithm in a separate class as follow:

//Using Dice's Coefficient algorithm
public class WordSimilarity {


    public static boolean isStringSimilar(String str1,String str2){
        return doComparison(str1,str2)>=Analyzer.getSimilarity();
    }

    /** @return lexical similarity value in the range [0,1] */
    private static double doComparison(String str1, String str2) {
        // If the strings are too small, do not compare them at all.
        try {
            if(str1.length()>3 && str2.length()>3) {
                ArrayList pairs1 = wordLetterPairs(str1.toUpperCase());
                ArrayList pairs2 = wordLetterPairs(str2.toUpperCase());
                int intersection = 0;
                int union = pairs1.size() + pairs2.size();
                for (int i = 0; i < pairs1.size(); i++) {
                    Object pair1 = pairs1.get(i);
                    for (int j = 0; j < pairs2.size(); j++) {
                        Object pair2 = pairs2.get(j);
                        if (pair1.equals(pair2)) {
                            intersection++;
                            pairs2.remove(j);
                            break;
                        }
                    }
                }
                return (2.0 * intersection) / union;
            }
            else{
                return 0;
            }
        }catch(NegativeArraySizeException ex){

            return 0;
        }
    }


    /** @return an ArrayList of 2-character Strings. */
    private static ArrayList wordLetterPairs(String str){
        ArrayList allPairs = new ArrayList();
        // Tokenize the string and put the tokens/words into an array
        String[] words = str.split("\\s");
        // For each word
        for(int w=0; w<words.length;w++){
            // Find the pairs of characters
            String[] pairsInWord = letterPairs(words[w]);
            for(int p=0;p<pairsInWord.length;p++){
                allPairs.add(pairsInWord[p]);
            }
        }
        return allPairs;
    }

    /** @return an array of adjacent letter pairs contained in the input string */
    private static String[] letterPairs(String str){
        int numPairs = str.length() -1;
        String[] pairs = new String[numPairs];
        for(int i=0; i<numPairs;i++){
            pairs[i]=str.substring(i,i+2);
        }
        return pairs;
    }

}

tl;dr Compare two words and return a number between 0 and 1 of how similar those String are.

I then created a custom WritableComparable (I intended to use this as a value along the project, though it is only key for now.):

public class Hashtag implements WritableComparable<Hashtag> {

    private Text hashtag;

    public Hashtag(){
        this.hashtag = new Text();
    }

    public Hashtag(String hashtag) {
        this.hashtag = new Text(hashtag);
    }

    public Text getHashtag() {
        return hashtag;
    }

    public void setHashtag(String hashtag) {
        // Remove characters that add no information to the analysis, but cause problems to the result
        this.hashtag = new Text(hashtag);
    }

    public void setHashtag(Text hashtag) {
        this.hashtag = hashtag;
    }

    // Compare To uses the WordSimilarity algorithm to determine if the hashtags are similar. If they are,
    // they are considered equal
    @Override
    public int compareTo(Hashtag o) {
        if(o.getHashtag().toString().equalsIgnoreCase(this.getHashtag().toString())){
            return 0;
        }else if(WordSimilarity.isStringSimilar(this.hashtag.toString(),o.hashtag.toString())){
            return 0;
        }else {
            return this.hashtag.toString().compareTo(o.getHashtag().toString());
        }
    }

    @Override
    public String toString() {
        return this.hashtag.toString();
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        this.hashtag.write(dataOutput);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.hashtag.readFields(dataInput);
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof Hashtag)) return false;
        Hashtag hashtag1 = (Hashtag) o;
        return WordSimilarity.isStringSimilar(this.getHashtag().toString(),hashtag1.getHashtag().toString());
    }

    @Override
    public int hashCode() {
        return Objects.hash(getHashtag());
    }

}

And finally, written the MapReduce Code:

public class HashTagCounter {

    private final static IntWritable one = new IntWritable(1);

    public static class HashtagCountMapper extends Mapper<Object, Text, Hashtag, IntWritable> {

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            //If the line does not start with '{', it is not a valid JSON. Ignore.
            if (value.toString().startsWith("{")) {
                Status tweet = null;
                try {
                    //Create a status object from Raw JSON
                    tweet = TwitterObjectFactory.createStatus(value.toString());
                    if (tweet!=null && tweet.getText() != null) {
                        StringTokenizer itr = new StringTokenizer(tweet.getText());
                        while (itr.hasMoreTokens()) {
                            String temp = itr.nextToken();
                            //Check only hashtags
                            if (temp.startsWith("#") && temp.length()>=3 &&  LanguageChecker.checkIfStringIsInLatin(temp)){
                                temp = purifyString(temp);
                                context.write(new Hashtag('#'+temp), one);
                            }
                        }
                    }
                } catch (TwitterException tex) {
                    System.err.println("Twitter Exception thrown: "+ tex.getErrorMessage());
                }
            }
        }
    }

    public static class HashtagCountCombiner extends Reducer<Hashtag, IntWritable, Hashtag, IntWritable> {

        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Hashtag key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static class HashtagCountReducer extends Reducer<Hashtag, IntWritable, Hashtag, IntWritable> {

        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Hashtag key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    private static String purifyString(String s){
        s = s.replaceAll(Analyzer.PURE_TEXT.pattern(),"").toLowerCase();
        s  = Normalizer.normalize(s, Normalizer.Form.NFD)
                .replaceAll("[^\\p{ASCII}]", "");
        return s.trim();
    }
}

Please note, all imports are in place in code, I just ommited them here to reduce an already text-heavy post.

The code runs normally, with no errors, and it mostly works. I say mostly, because in the part-r-0000 file I get several entries like this:

  • milwauke 2
  • XXXXXXXX <----some other strings
  • milwauke 1
  • XXXXXXXX <----some other strings
  • milwauke 7

and so on. I tested those strings on a notepad, and they appear perfectly identical (I originally thought it could be an encoding issue. Not the case, all such hashtags in the original file show as UTF8).

It does not happen for all hashtags, but it happens for quite some few. I could theoretically run a second Mapreduce job on the output, and combine them properly and without hassle (we are talking about a 100kb file produced from a 10GB input file), but I believe this is a waste of computing power.

This has lead me to believe that I was missing something in how MapReduce works. It's driving me crazy. Can anyone explain to me what I am doing so wrong, where the error in my logic is?

I guess HashTag implementation is causing the issue. Text and String differs when it encounters double byte character from a sequence of UTF-8 characters. Moreover Text is mutable and String is not, and also the expected behavior with String manipulation may not be same with Text manipulation..

so just read just 4 pages [115, 118] (both inclusive) from the below link which will open a pdf file pointing to hadoop-the definitive guide..

http://javaarm.com/file/apache/Hadoop/books/Hadoop-The.Definitive.Guide_4.edition_a_Tom.White_April-2015.pdf

Hope this read might help you to resolve the exact issue..

Thanks..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM