简体   繁体   中英

How to remove line breaks and empty lines from String

I am trying to run a mapreduce job on hadoop which reads the fifth entry of a tab delimited file (fifth entry are user reviews) and then do some sentiment analysis and word count on them.

However, as you know with user reviews, they usually include line breaks and empty lines. My code iterates through the words of each review to find keywords and check sentiment if keyword is found.

The problem is as the code iterates through the review, it gives me ArrayIndexOutofBoundsException Error because of these line breaks and empty lines in one review.

I have tried using replaceAll("\\r", " ") and replaceAll("\\n", " ") to no avail.

I have also tried if(tokenizer.countTokens() == 2){ word.set(tokenizer.nextToken());} else { }

also to no avail. Below is my code:

public class KWSentiment_Mapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> keywordsList = new ArrayList<String>();
ArrayList<String> posWordsList = new ArrayList<String>();
ArrayList<String> tokensList = new ArrayList<String>();
int e;

@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

    String[] line = value.toString().split("\t");
    String Review = line[4].replaceAll("[\\-\\+\\\\)\\.\\(\"\\{\\$\\^:,]", "").toLowerCase();

    StringTokenizer tokenizer = new StringTokenizer(Review);

    while (tokenizer.hasMoreTokens()) {
        // 1- first read the review line and store the tokens in an arraylist, 2-
        // iterate through review to check for KW if found
        // 3-check if there's PosWord near (upto +3 and -2)
        // 4- setWord & context.write 5- null the review line arraylist
        String CompareString = tokenizer.nextToken();

        tokensList.add(CompareString);
    }
    {
    for (int i = 0; i < tokensList.size(); i++)

    {

        for (int j = 0; j < keywordsList.size(); j++) {
            boolean flag = false;

            if (tokensList.get(i).startsWith(keywordsList.get(j)) == true) {

                for (int e = Math.max(0, i - 2); e < Math.min(tokensList.size(), i + 4); e++) {

                    if (posWordsList.contains(tokensList.get(e))) {

                        word.set(keywordsList.get(j));
                        context.write(word, one);
                        flag = true;

                        break; // breaks out of e loop }}
                    }
                }
            }
            if (flag)
                break;
        }
    }
    tokensList.clear();
}

}

Expected results are such that: Take these two cases of reviews where error occurs:

Case 1: "Beautiful and spacious!
I highly recommend this place and great host."

Case 2: "The place in general was really silent but we didn't feel stayed.

Aside from this, the bathroom is big and the shower is really nice but there problem. "

The system should read the whole review as one line and iterate through the words in it. However, it just stops as it finds a line break or an empty line as in case 2.
Case 1 should be read such as: "Beautiful and spacious! I highly recommend this place and great host."

Case 2 should be:"The place in general was really silent but we didn't feel stayed. Aside from this, the bathroom is big and the shower is really nice but there problem. "

I am running out of time and would really appreciate help here.

Thanks!

So, I hope I am understanding what what you are trying to do.... If I am reading what you have above correctly, the value of 'value' passed into your map function above contains the delimited value that you would like to parse the user reviews out of. If that is the case, I believe we can make use of the escaping functionality in the opencsv library using tabs as your delimiting character instead of commas to correctly populate the user review field: http://opencsv.sourceforge.net

In this example we are reading one line from the input that is passed in and parsing it into 'columns' base on the tab character and placing the results in the 'nextLine' array. This will allow us to use the escaping functionality of the CSVReader without reading an actual file and instead using the value of the text passed into your map function.

        StringReader reader = new StringReader(value.toString());
        CSVReader csvReader = new CSVReader(reader, '\t', '\"', '\\', 0);

        String [] nextLine = csvReader.readNext();
        if(nextLine != null && nextLine.length >= 5) {
           // Do some stuff
        }

In the example that you pasted above, I think even that split("\\n") will be problematic as tabs within a user review split into two results in the result in addition to new lines being treated as new records. But, both of these characters are legal as long as they are inside a quoted value (as they should be in a properly escaped file and as they are in your example). CSVReader should handle all of these.

Validate each line at the start of the map method, so that you know line[4] exists and isn't null.

if (value == null || value.toString == null) {
    return;
}

String[] line = value.toString().split("\t");
if (line == null || line.length() < 5 || line[4] == null) {
    return;
}

As for line breaks, you'll need to show some sample input. By default MapReduce passes each line into the map method independently, so if you do want to read multiple lines as one message, you'll have to write a custom InputSplit , or pre-format your data so that all data for each review is on the same line.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM