简体   繁体   中英

How to splitting records based white spaces when different lines have spaces at different positions

I have a file with records as below and I am trying to split the records in it based on white spaces and convert them into comma.

file:

a 3w 12 98 header P6124
e 4t 2  100 header I803
c 12L 11 437       M12


BufferedReader reader = new BufferedReader(new FileReader("/myfile.txt"));
String line = reader.readLine();
while (line != null) {
System.out.println(line);
line = reader.readLine();
String[] splitLine = line.split("\\s+")

If the data is separated by multiple white spaces, I usually go for regex replace -> split('\\\\s+') or split(" +") . But in the above case, I have a record c which doesn't have the data header . Hence the regex "\\s+" or " +" will just skip that record and I will get an empty space as c,12L,11,437,M12 instead of c,12L,11,437,,M12

How do I properly split the lines based on any delimiter in this case so that I get data in the below format:

a,3w,12,98,header,P6124
e,4t,2,100,header,I803
c,12L,11,437,,M12

Could anyone let me know how I can achieve this ?

May be you can try using a more complicated approach, using a complex regex in order to match exatcly six fields for each line and handling explicitly the case of a missing value for the fifth one. I rewrote your example adding some console log in order to clarify my suggestion:

public class RegexTest {
    private static final String Input = "a 3w 12 98 header P6124\n" +
            "e 4t 2  100 header I803\n" +
            "c 12L 11 437       M12";

    public static void main(String[] args) throws Exception {
        BufferedReader reader = new BufferedReader(new StringReader(Input));
        String line = null;
        Pattern pattern = Pattern.compile("^([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+)? +([^ ]+)$");

        do {
            line = reader.readLine();
            System.out.println(line);
            if(line != null) {
                String[] splitLine = line.split("\\s+");
                System.out.println(splitLine.length);

                System.out.println("Line: " + line);
                Matcher matcher = pattern.matcher(line);
                System.out.println("matches: " + matcher.matches());
                System.out.println("groups: " + matcher.groupCount());
                for(int i = 1; i <= matcher.groupCount(); i++) {
                    System.out.printf("   Group %d has value '%s'\n", i, matcher.group(i));
                }
            }
        } while (line != null);
    }
}

The key is that the pattern used to match each line requires a sequence of six fields:

  • for each field, the value is described as [^ ]+
  • separators between fields are described as +
  • the value of the fifth ( nullable ) field is described as [^ ]+?
  • each value is captured as a group using parentheses: ( ... )
  • start ( ^ ) and end ( $ ) of each line are marked explicitly

Then, each line is matched against the given pattern, obtaining six groups: you can access each group using matcher.group(index) , where index is 1-based because group(0) returns the full match.

This is a more complex approach but I think it can help you to solve your problem.

Put a limit on the number of whitespace chars that may be used to split the input.

In the case of your example data, a maximum of 5 works:

String[] splitLine = line.split("\\s{1,5}");

See live demo (of this code working as desired).

Are you just trying to switch your delimiters from spaces to commas?

In that case: cat myFile.txt | sed 's/ */ /g' | sed 's/ /,/g' cat myFile.txt | sed 's/ */ /g' | sed 's/ /,/g'

*edit: added a stage to strip out lists of more than two spaces, replacing them with just the two spaces needed to retain the double comma.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM