繁体   English   中英

从文本中提取信息

[英]Extract information from text

我有以下文字:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.              

Name                                 Group                       12345678        
ALEX A ALEX                                                                   
ID#                                  PUBLIC NETWORK                  
XYZ123456789                                                                  


Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

我想提取位于文本中ID#关键字下的ID值。

问题在于,在不同的文本文件中, ID可以位于不同的位置,例如,在另一个文本的中间,如下所示:

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's          
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.       

另外, ID#和值之间可以有多余的行:

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's      
printing and typesetting industry. Lorem Ipsum has been the                                  printing and typesetting industry. Lorem Ipsum has been the 
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.

您能否展示一种方法来提取上述ID#值? 是否可以在此处应用任何标准技术来提取此信息? 例如RegEx或RegEx顶部的某种方法。 可以在这里申请NLP吗?

以下是我脑海中的一个建议。 通常的想法是将源文本转换为行(或列表)的数组,然后遍历它们直到找到“ ID#”令牌。 一旦知道ID#在该行中的位置,然后遍历其余行以在该位置找到一些文本。 此示例应与您提供的示例一起使用,尽管任何不同之处都可能导致其返回错误的值。

String s = null; //your source text
String idValue = null; //what we'll assign the ID value to

//format the string into lines
String[] lines = s.split("\\r?\\n"); //this handles both Windows and Unix-style line termination

//go through the lines looking for the ID# token and storing it's horizontal position in in the line
for (int i=0; i<lines.length; i++) {
    String line = lines[i];
    int startIndex = line.indexOf("ID#");

    //if we found the ID token, then go through the remaining lines starting from the next one down
    if (startIndex > -1) {
        for (int j=i+1; j<lines.length; j++) {
            line = lines[j];

            //check if this line is long enough
            if (line.length() > startIndex) {

                //remove everything prior to the index where the ID# token was
                line = line.substring(startIndex);

                //if the line starts with a space then it's not an ID
                if (!line.startsWith(" ")) {

                    //look for the first whitespace after the ID value we've found
                    int endIndex = line.indexOf(" ");

                    //if there's no end index, then the ID is at the end of the line
                    if (endIndex == -1) {
                        idValue = line;
                    } else {
                        //if there is an end index, then remove everything to just leave the ID value
                        idValue = line.substring(0, endIndex);
                    }

                    break;
                }
            }
        }

        break;
    }

}

似乎没有明确的ID值格式,因此单行正则表达式无济于事,因为这里几乎没有正则表达式。

您必须使用两个正则表达式来获得预期的输出。 第一个是:

(?m)^(.*)ID#.*([\s\S]*)

它尝试逐行查找ID# 它捕获两个字符串块。 第一个块是从该行的开头到ID#所有内容,然后是ID#所在的行之后出现的所有内容。

然后我们计算第一个捕获组的长度。 它为我们提供了列号,我们应该在下几行中开始搜索ID:

m.group(1).length();

然后,我们使用该长度构建第二个正则表达式:

(?m)^.{X}(?<!\S)\h{0,3}(\S+)

分解:

  • (?m)启用多行模式
  • ^匹配行首
  • .{X}匹配前X个字符(X为m.group(1).length()
  • (?<!\\S)检查当前位置是否在空格字符之前
  • \\h{0,3}可选地匹配水平空格,最多3个字符(如果值向右移动)
  • (\\S+)捕获以下非空白字符

然后,我们在先前的正则表达式的第二个捕获组上运行此正则表达式:

Matcher m = Pattern.compile("(?m)^(.*)ID#.*([\\s\\S]*)").matcher(string);                  
if (m.find()) {
    Matcher m1 = Pattern.compile("(?m)^.{" + m.group(1).length() + "}(?<!\\S)\\h{0,3}(\\S+)").matcher(m.group(2));
    if (m1.find())
        System.out.println(m1.group(1));
}

现场演示

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM