简体   繁体   中英

Using regex to parse a string from text that includes a newline

Given the following text, I'm trying to parse out the string "TestFile" after Address: :

File: TestFile
Branch


        OFFICE INFORMATION
            Address: TestFile
            City: L.A.
            District.: 43
            State: California
            Zip Code: 90210

        DISTRICT INFORMATION
            Address: TestFile2
            ....

I understand that lookbehinds require zero-width so quantifiers are not allowed, meaning this won't work:

(?<=OFFICE INFORMATION\n\s*Address:).*(?=\n)

I could use this

(?<=OFFICE INFORMATION\n            Address:).* 

but it depends on consistent spacing, which isn't dynamic and thus not ideal.

How do I reliably parse out "TestFile" and not "TestFile2" as shown in my example above. Note that Address appears twice but I only need the first value.

Thank you

You don't really need to use a lookbehind here. Get your matched text using captured group:

(?:\bOFFICE INFORMATION\s+Address:\s*)(\S+)

RegEx Demo

captured group #1 will have value TestFile

JS Code:

var re = /(?:\bOFFICE INFORMATION\s+Address:\s*)(\S+)/; 
var m;
var matches = []; 
if ((m = re.exec(input)) !== null) {
    if (m.index === re.lastIndex)
        re.lastIndex++;
    matches.push(m[1]);
}
console.log(matches);

Working with Array:

// A sample String
String questions = "File: TestFile Branch OFFICE INFORMATION Address: TestFile  City: L.A.   District.: 43       State: California     Zip Code: 90210       DISTRICT INFORMATION           Address: TestFile2";

// An array list to store split elements
ArrayList arr = new ArrayList();

// Split based on colon and spaces.
// Including spaces resolves problems for new lines etc
for(String x : questions.split(":|\\s"))
// Ignore blank elements, so we get a clean array
    if(!x.trim().isEmpty())
        arr.add(x);

This will give you an array which is:

[File, TestFile, Branch, OFFICE, INFORMATION, Address, TestFile, City, L.A., District., 43, State, California, Zip, Code, 90210, DISTRICT, INFORMATION, Address, TestFile2]

Now lets analyze... suppose you want information corresponding to Address , or element Address . This element is at position 5 in array. That means element 6 is what you want.

So you would do this:

String address = arr.get(6);

This will return you testFile .

Similarly for City , element 8 is what you want. The count starts from 0 . You can ofcourse modify my matching pattern or even create a loop and get yourself even better ways to do this task. This is just a hint.

Here is one such example loop:

// Every i+1 is the property tag, and every i+2 is the property name for 
// Skip first 6 elements because they are of no real purpose to us
for(int i = 6; i<(arr.size()/2)+6; i+=2)
    System.out.println(arr.get(i));

This gives following output:

TestFile
L.A.
43
California
Code

Ofcourse this loop is unrefined, refine it a little and you will get every element correctly. Even the last element. Or better yet, use ZipCode instead of Zip Code and dont use spaces in between and you will have a perfect loop with nothing much to be done in addition).

The advantage over using direct regex: You wont have to specify the regex for every single element. Iteration is always more handy to get things done automatically.

See this

//read input from file
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File("D:/tests/sample.txt"))));
StringBuilder string = new StringBuilder();
String line = "";

while((line = reader.readLine()) != null){
    string.append(line);
    string.append("\n");
}
//now string will contain the input as
/*File: TestFile
Branch


        OFFICE INFORMATION
            Address: TestFile
            City: L.A.
            District.: 43
            State: California
            Zip Code: 90210

        DISTRICT INFORMATION
            Address: TestFile2
            ....*/
Pattern regex = Pattern.compile("(OFFICE INFORMATION.*\\r?\\n.*Address:(?<officeAddress>.*)\\r?\\n)");
Matcher regexMatcher = regex.matcher(string.toString());
while (regexMatcher.find()) {
    System.out.println(regexMatcher.group("officeAddress"));//prints TestFile
} 

You can see the named group officeAddress in the pattern which is needed to be extracted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM