简体   繁体   English

如何在java中的不完整行之间捕获文本

[英]How to capture text between uncomplete lines in java

I have got this text (numerical values might change) : 我有这个文本(数值可能会改变):

.START_SEQUENCE RANDOM SENTENCE .START_SEQUENCE RANDOM SENTENCE
3.40000 3.40000
1 2 3 4 some text or not 1 2 3 4一些文字与否
4 3 8 9 4 3 8 9
.END_SEQUENCE .END_SEQUENCE

I want to get the following text (so basically find everything between .START_SEQUENCE and .END_SEQUENCE, but without neither the end of the START_SEQUENCE line nor the next one) 我想得到以下文本(所以基本上找到.START_SEQUENCE和.END_SEQUENCE之间的所有内容,但既没有START_SEQUENCE行的结尾也没有下一行)

1 2 3 4 some text or not 1 2 3 4一些文字与否
4 3 8 9 4 3 8 9

I have played with Pattern.DOTALL, Pattern.MULTILINE, managed to get rid off things but never ending up on the exact selection I want. 我玩过Pattern.DOTALL,Pattern.MULTILINE,设法摆脱了一些事情,但从未结束我想要的确切选择。 I have no clue how to move on. 我不知道如何继续前进。

Here is my last attempt. 这是我的最后一次尝试。

final String START_SEQUENCE = "\\.START_SEQUENCE[^\n^\r]*";
final String END_SEQUENCE = "\\.END_SEQUENCE";
Pattern regex = Pattern.compile(START_SEQUENCE+"(.*)"+END_SEQUENCE, Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(emn);
if (regexMatcher.find()) {
    String ResultString = regexMatcher.group(1);
}

Which result is 结果是什么

3.40000 3.40000
1 2 3 4 some text or not 1 2 3 4一些文字与否
4 3 8 9 4 3 8 9

Many thanks in advance ! 提前谢谢了 !

Use this regex with Pattern.UNIX_LINES flag: 将此正则表达式与Pattern.UNIX_LINES标志一起使用:

"\\.START_SEQUENCE.*\n.*\n((?:(?!\\.END_SEQUENCE).*\n)*+)\\.END_SEQUENCE"

Explanation 说明

Pattern.UNIX_LINES makes . Pattern.UNIX_LINES使. equivalent to [^\\n] . 相当于[^\\n] Normally, it is [^\\n\\r\…\
\
] . 通常,它是[^\\n\\r\…\
\
]

Let us break down the regex (to make it easier to read, escape sequences are resolved): 让我们分解正则表达式(使其更容易阅读,转义序列得到解决):

\.START_SEQUENCE.*\n             # Match the .START_SEQUENCE ... line
.*\n                             # Match (and ignore) the next line
((?:(?!\\.END_SEQUENCE).*\n)*+)
\.END_SEQUENCE                   # Match the .END_SEQUENCE line

((?:(?!\\\\.END_SEQUENCE).*\\n)*+) matches the rest of the lines in between and put the result into capturing group 1. Normally, ((?:.*\\n)*?) would suffice, but to prevent StackOverflowError on big set of data, I switch to possessive quantifier *+ and a check (?!\\\\.END_SEQUENCE) is needed so that the repetition can complete without backtracking. ((?:(?!\\\\.END_SEQUENCE).*\\n)*+)匹配其中的其余行,并将结果放入捕获组1.通常, ((?:.*\\n)*?)就足够了,但是为了防止大数据集上的StackOverflowError ,我切换到占有量词*+并且需要检查(?!\\\\.END_SEQUENCE)以便重复可以在没有回溯的情况下完成。

不是很多,但这样的事情,捕获组1包含感兴趣的数据。

(?-s)\.START_SEQUENCE.*\n.*\n([\S\s]*?)\.END_SEQUENCE

A non-regex solution: 非正则表达式解决方案:

import  java.util.ArrayList;
import  java.io.File;
import  java.io.IOException;
import  org.apache.commons.io.FileUtils;
import  org.apache.commons.io.LineIterator;

/**
   <P>{@code java BetweenLineMarkersButSkipFirstXmpl C:\java_code\\xbn\z\xmpl\text\regex\BetweenLineMarkersButSkipFirstXmpl_data.txt}</P>
**/
public class BetweenLineMarkersButSkipFirstXmpl  {
   public static final void main(String[] as_1RqdTxtFilePath)  {
      LineIterator li = null;
      try  {
         li = FileUtils.lineIterator(new File(as_1RqdTxtFilePath[0])); //Throws npx if null
      }  catch(IOException iox)  {
         throw  new RuntimeException("Attempting to open \"" + as_1RqdTxtFilePath[0] + "\"", iox);
      }  catch(RuntimeException rtx)  {
         throw  new RuntimeException("One required parameter: The path to the text file.", rtx);
      }

      String sLS = System.getProperty("line.separator", "\n");

      ArrayList<String> alsItems = new ArrayList<String>();
      boolean bStartMark = false;
      boolean bLine1Skipped = false;
      StringBuilder sdCurrentItem = new StringBuilder();
      while(li.hasNext())  {
         String sLine = li.next().trim();
         if(!bStartMark)  {
            if(sLine.startsWith(".START_SEQUENCE"))  {
               bStartMark = true;
               continue;
            }
            throw  new IllegalStateException("Start mark not found.");
         }  if(!bLine1Skipped)  {
            bLine1Skipped = true;
            continue;
         }  else if(!sLine.equals(".END_SEQUENCE"))  {
            sdCurrentItem.append(sLine).append(sLS);
         }  else  {
            alsItems.add(sdCurrentItem.toString());
            sdCurrentItem.setLength(0);
            bStartMark = false;
            bLine1Skipped = false;
            continue;
         }
      }

      for(String s : alsItems)  {
         System.out.println("----------");
         System.out.print(s);
      }
   }
}

Using this input: 使用此输入:

.START_SEQUENCE RANDOM SENTENCE
3.40000
1 2 3 4
4 3 8 9
.END_SEQUENCE
.START_SEQUENCE RANDOM SENTENCE
3.40000
2 3 4 5
3 8 9 10
.END_SEQUENCE

Output: 输出:

[C:\java_code\]java BetweenLineMarkersButSkipFirstXmpl C:\java_code\BetweenLineMarkersButSkipFirstXmpl_data.txt
----------
1 2 3 4
4 3 8 9
----------
2 3 4 5
3 8 9 10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM