簡體   English   中英

從文本中提取信息

[英]Extract information from text

我有以下文字:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.              

Name                                 Group                       12345678        
ALEX A ALEX                                                                   
ID#                                  PUBLIC NETWORK                  
XYZ123456789                                                                  


Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

我想提取位於文本中ID#關鍵字下的ID值。

問題在於,在不同的文本文件中, ID可以位於不同的位置,例如,在另一個文本的中間,如下所示:

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's          
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.       

另外, ID#和值之間可以有多余的行:

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's      
printing and typesetting industry. Lorem Ipsum has been the                                  printing and typesetting industry. Lorem Ipsum has been the 
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.

您能否展示一種方法來提取上述ID#值? 是否可以在此處應用任何標准技術來提取此信息? 例如RegEx或RegEx頂部的某種方法。 可以在這里申請NLP嗎?

以下是我腦海中的一個建議。 通常的想法是將源文本轉換為行(或列表)的數組,然后遍歷它們直到找到“ ID#”令牌。 一旦知道ID#在該行中的位置,然后遍歷其余行以在該位置找到一些文本。 此示例應與您提供的示例一起使用,盡管任何不同之處都可能導致其返回錯誤的值。

String s = null; //your source text
String idValue = null; //what we'll assign the ID value to

//format the string into lines
String[] lines = s.split("\\r?\\n"); //this handles both Windows and Unix-style line termination

//go through the lines looking for the ID# token and storing it's horizontal position in in the line
for (int i=0; i<lines.length; i++) {
    String line = lines[i];
    int startIndex = line.indexOf("ID#");

    //if we found the ID token, then go through the remaining lines starting from the next one down
    if (startIndex > -1) {
        for (int j=i+1; j<lines.length; j++) {
            line = lines[j];

            //check if this line is long enough
            if (line.length() > startIndex) {

                //remove everything prior to the index where the ID# token was
                line = line.substring(startIndex);

                //if the line starts with a space then it's not an ID
                if (!line.startsWith(" ")) {

                    //look for the first whitespace after the ID value we've found
                    int endIndex = line.indexOf(" ");

                    //if there's no end index, then the ID is at the end of the line
                    if (endIndex == -1) {
                        idValue = line;
                    } else {
                        //if there is an end index, then remove everything to just leave the ID value
                        idValue = line.substring(0, endIndex);
                    }

                    break;
                }
            }
        }

        break;
    }

}

似乎沒有明確的ID值格式,因此單行正則表達式無濟於事,因為這里幾乎沒有正則表達式。

您必須使用兩個正則表達式來獲得預期的輸出。 第一個是:

(?m)^(.*)ID#.*([\s\S]*)

它嘗試逐行查找ID# 它捕獲兩個字符串塊。 第一個塊是從該行的開頭到ID#所有內容,然后是ID#所在的行之后出現的所有內容。

然后我們計算第一個捕獲組的長度。 它為我們提供了列號,我們應該在下幾行中開始搜索ID:

m.group(1).length();

然后,我們使用該長度構建第二個正則表達式:

(?m)^.{X}(?<!\S)\h{0,3}(\S+)

分解:

  • (?m)啟用多行模式
  • ^匹配行首
  • .{X}匹配前X個字符(X為m.group(1).length()
  • (?<!\\S)檢查當前位置是否在空格字符之前
  • \\h{0,3}可選地匹配水平空格,最多3個字符(如果值向右移動)
  • (\\S+)捕獲以下非空白字符

然后,我們在先前的正則表達式的第二個捕獲組上運行此正則表達式:

Matcher m = Pattern.compile("(?m)^(.*)ID#.*([\\s\\S]*)").matcher(string);                  
if (m.find()) {
    Matcher m1 = Pattern.compile("(?m)^.{" + m.group(1).length() + "}(?<!\\S)\\h{0,3}(\\S+)").matcher(m.group(2));
    if (m1.find())
        System.out.println(m1.group(1));
}

現場演示

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM