简体   繁体   中英

Read in multiple lines from a file and combine them into one line, based on a start and ending pattern?

I'm writing a program to try and clean data from a text file I have. The file contains text messages between myself and friends, so it looks like this format:

06/07/2016, 21:44 - Friend 1: Sure. 

So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.
06/07/2016, 21:44 - Friend 1: Any further questions?
06/07/2016, 21:45 - Friend 1: Just to clarify, one must apply before, not after, said date.
06/07/2016, 21:42 - Friend 2: Still getting my head around this. Could you explain the deadline thing once more
06/07/2016, 21:46 - Friend 3: All I can say is that I've some fantastic friends that will always endeavour me!
06/07/2016, 21:47 - Friend 3: I truly appreciate this
28/12/2016, 19:04 - Friend 4: Woo party not in mine and eds 🥂🎉🎉
28/12/2016, 19:14 - Friend 1: You going?
Steve?
28/12/2016, 19:15 - Friend 5: got ppl renting in house til end of January

So this is all stored in a .txt file, and I want to clean the data and convert it to a .csv file that essentially contains the columns Date, Time, Name, Text

I was trying to loop through the file and split the line and write it to a new CSV file, so for example these line in the file:

06/07/2016, 21:44 - Friend 1: Sure. 

So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.

would be combined into one line like this:

06/07/2016, 21:44 - Friend 1: Sure. So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.

I know that every new message starts with the same pattern of a date in the format dd/mm/yyyy. So I'm using that to determine when a new message is encountered

Right now I'm not working on writing it to CSV file, just reformatting the text into the correct format before doing further processing on it. But for the example input I've given above, it outputs:

06/07/2016, 21:44 - Friend 1: Sure.   So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.
06/07/2016, 21:44 - Friend 1: Any further questions?
06/07/2016, 21:45 - Friend 1: Just to clarify, one must apply before, not after, said date.
06/07/2016, 21:42 - Friend 2: Still getting my head around this. Could you explain the deadline thing once more
06/07/2016, 21:46 - Friend 3: All I can say is that I've some fantastic friends that will always endeavour me!
06/07/2016, 21:47 - Friend 3: I truly appreciate this
28/12/2016, 19:04 - Friend 4: Woo party not in mine and eds 🥂🎉🎉
28/12/2016, 19:14 - Friend 1: You going?

Steve?
28/12/2016, 19:15 - Friend 5: got ppl renting in house til end of January

So you can see it's worked for the first case, but not the second, and I'm having trouble coming up with a solution to fix this. My code is below, can anyone offer me some advice on how to solve this?

Code

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class App {

    private static String line;
    private static final String regex = "^\\d{2}\\/\\d{2}\\/\\d{4}";
    private static Pattern pattern;

    public static void main(String[] args) {

        pattern = Pattern.compile(regex);

        try {
            BufferedReader reader = new BufferedReader(new FileReader("src/main/resources/WhatsAppChat2.txt"));
            while ((line = reader.readLine()) != null) {
                StringBuilder sb = new StringBuilder();
                boolean isNewMessage = identifyNewMessage();

                //If message is split over multiple lines, it is combined into one line
                if(isNewMessage) {
                    sb.append(line);    
                    while ((line = reader.readLine()) != null) {
                        String text = line;
                        isNewMessage = identifyNewMessage();
                        if(!isNewMessage) {
                            sb.append(" " + line);
                        }
                        else {
                            break;
                        }
                    }
                }

                System.out.println(sb.toString());
                System.out.println(line);
                //formatText(sb.toString());
                //formatText(line);
            }
            reader.close();
        } 
        catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * Checks if file line is a new message or not
     * @return      - True if it is a message message, False if not
     */
    private static boolean identifyNewMessage() {

        Matcher m = pattern.matcher(line);
        if(m.find()) {
            return true;
        }
        else {
            return false;
        }
    }
}

With this pattern:

^(\d{2}\/\d{2}\/\d{4}), (\d{2}:\d{2}) - (.*):(.*)$

You should be able to pick up 4 capturing groups.

1- The date as 99/99/9999
2- The time as 99:99
3- The Friend's name (anything after thi hiphen witha following space, and the ':' character.
4- The comments that are whatever comes after the ':' character upto the end of the sentence.

By reading each capturing group you can format the output of the csv file.

Take in mind that the pattern assumes the white spaces as you wrote them in the example.

If memory and speed are not an issue (I doubt they are with a discussion log), I would do it that way:

Deque<String> mergedLines = new LinkedList<> ();

while ((line = reader.readLine()) != null) {
  if (!identifyNewMessage()) {
    String currentLine = mergedLines.removeLast();
    line = currentLine + " " + line;
  }
  mergedLines.add(line);
}

Now you can iterate over the list and do whatever you need to do with the lines.

Note that the code will throw an exception if the first line is not a new message.

You could use

^
(?P<date>\d{2}[^-]+)\s+-\s+
(?P<friend>[^:]+):
(?P<msg>[\s\S]+?(?=^\d{2}|\Z))


Broken down:

 ^ # start of the line (?P<date>\\d{2}[^-]+)\\s+-\\s+ # two digits, followed by anything not a - (?P<friend>[^:]+): # the friendly neighborhood group (?P<msg>[\\s\\S]+?(?=^\\d{2}|\\Z)) # match anything up to either # a new date or the very end of the string

See a demo on regex101.com ( and mind the modifiers, additionally, backslashes need to be escaped in Java ).


As @assylias points out, one needs to read the whole file as a string before.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM