简体   繁体   中英

Extract information from text

I have the following text:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.              

Name                                 Group                       12345678        
ALEX A ALEX                                                                   
ID#                                  PUBLIC NETWORK                  
XYZ123456789                                                                  


Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

I want to extract the ID value which is located under the ID# keyword in the text.

The issue is that in different text files ID can be located at the different location, for example in the middle of another text, like this:

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's          
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.       

Also, can have extra lines between ID# and value:

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's      
printing and typesetting industry. Lorem Ipsum has been the                                  printing and typesetting industry. Lorem Ipsum has been the 
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.

Could you please show an approach how the mentioned ID# value can be extracted? Is there any standard technic that can be applied here in order to extract this information? For example RegEx or some approach on the top of RegEx . Is it possible to apply NLP here?

The below is a suggestion off the top of my head. The general idea is to convert your source text into an array of lines (or List), then iterate through them until you find that "ID#" token. Once you know where ID# is in that line, then iterate through the remaining lines to find some text at that position. This example should work with the examples you gave, although anything different will probably cause it to return the wrong value.

String s = null; //your source text
String idValue = null; //what we'll assign the ID value to

//format the string into lines
String[] lines = s.split("\\r?\\n"); //this handles both Windows and Unix-style line termination

//go through the lines looking for the ID# token and storing it's horizontal position in in the line
for (int i=0; i<lines.length; i++) {
    String line = lines[i];
    int startIndex = line.indexOf("ID#");

    //if we found the ID token, then go through the remaining lines starting from the next one down
    if (startIndex > -1) {
        for (int j=i+1; j<lines.length; j++) {
            line = lines[j];

            //check if this line is long enough
            if (line.length() > startIndex) {

                //remove everything prior to the index where the ID# token was
                line = line.substring(startIndex);

                //if the line starts with a space then it's not an ID
                if (!line.startsWith(" ")) {

                    //look for the first whitespace after the ID value we've found
                    int endIndex = line.indexOf(" ");

                    //if there's no end index, then the ID is at the end of the line
                    if (endIndex == -1) {
                        idValue = line;
                    } else {
                        //if there is an end index, then remove everything to just leave the ID value
                        idValue = line.substring(0, endIndex);
                    }

                    break;
                }
            }
        }

        break;
    }

}

It seems that there is no clear format for value of ID so a one-liner Regular Expression couldn't help for the reason that there is almost nothing regular here.

You have to use two Regular Expressions to achieve expected output. First one is:

(?m)^(.*)ID#.*([\s\S]*)

It tries to find ID# in lines individually. It captures two chunks of strings. First chunk is everything from beginning of that line up to ID# then everything that appears after the line that ID# resides.

Then we calculate the length of first capturing group. It gives us column number which we should start our search for an ID in next lines:

m.group(1).length();

Then we build our second regex that uses this length:

(?m)^.{X}(?<!\S)\h{0,3}(\S+)

Breakdown:

  • (?m) Enable multiline mode
  • ^ Match beginning of line
  • .{X} Match first X character(s) (X is m.group(1).length() )
  • (?<!\\S) Check if current position comes before a space character
  • \\h{0,3} Match horizontal whitespaces optionally up to 3 characters (in case of value is shifted to right)
  • (\\S+) Capture following non-whitespace characters

We then run this regex over second capturing group of previous regex:

Matcher m = Pattern.compile("(?m)^(.*)ID#.*([\\s\\S]*)").matcher(string);                  
if (m.find()) {
    Matcher m1 = Pattern.compile("(?m)^.{" + m.group(1).length() + "}(?<!\\S)\\h{0,3}(\\S+)").matcher(m.group(2));
    if (m1.find())
        System.out.println(m1.group(1));
}

Live demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM