简体   繁体   English

需要正则表达式来匹配第一个实例直到下一个实例(不包括下一个“前瞻”)Java

[英]Need regex to match first instance until next instance (excluding next “look-ahead”) Java

I'm new to programming and regular expressions so that's my disclaimer. 我是编程和正则表达式的新手,所以这是我的免责声明。

I'm trying to parse my way through a wireshark log that I've transferred over to a txt file using tshark. 我正在尝试通过Wireshark日志解析我的方式,该日志已使用tshark转移到txt文件。

The point of my program is to start at the top of the txt file and match all text between packet headers. 我的程序的重点是从txt文件的顶部开始,并匹配数据包头之间的所有文本。

All packets begin with Frame\\s+\\d , which excluding the next packet header and drop that text in a string. 所有数据包均以Frame\\s+\\d开头,其中不包括下一个数据包头,并将该文本放入字符串中。

I'm instantiating an object ( Packets ) and then adding them to an ArrayList for later processing. 我正在实例化一个对象( Packets ),然后将它们添加到ArrayList以供以后处理。

I need to gather all text from packet header 1 to end of packet 1 / beginning of packet header 2, without including packet header 2. 我需要收集从数据包头1到数据包1的末尾/数据包头2的开头的所有文本,而不包括数据包头2。

Frame 1 (186 bytes on wire, 186 bytes captured)
    Arrival Time: Sep 19, 2013 13:25:19.937150000
    [Time delta from previous captured frame: 0.000000000 seconds]
    [Time delta from previous displayed frame: 0.000000000 seconds]
    [Time since reference or first frame: 0.000000000 seconds]
    Frame Number: 1
    Frame Length: 186 bytes
    Capture Length: 186 bytes
    [Frame is marked: False]
    [Protocols in frame
............................A bunch of more packet data...............
    Encrypted Packet: 88FE0AFA38B3E1994B907F778FC42CD4FBD967F3D9101679...

Frame 2 (60 bytes on wire, 60 bytes captured)
    Arrival Time: Sep 19, 2013 13:25:19.938495000
    [Time delta from previous captured frame: 0.001345000 seconds]
    [Time delta from previous displayed frame: 0.001345000 seconds]

I've tried: 我试过了:

(Frame\s\d)*.?Frame\s\d

But not dice. 但不是骰子。

I've been plugging away on rubular.com to see if I can hit paydirt on this but I can't seem to match what I need. 我一直在访问rubular.com,看看是否可以按此付款,但似乎无法满足我的需求。

Thoughts? 有什么想法吗?

Considering a file packets.txt in /your/path , containing the example you posted... 考虑/your/path中的一个packets.txt文件,其中包含您发布的示例...

Here's a solution. 这是一个解决方案。

try {
    // trivial file operations
    String path = "/your/path/packets.txt";
    File file = new File(path);
    BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
    String line = null;
    StringBuilder contents = new StringBuilder();
    while ((line = br.readLine()) != null) {
        contents.append(line);
    }
    br.close();
    // the Pattern
    Pattern p = Pattern.compile("Frame\\s\\d\\s(.+?(?=Frame|$))", Pattern.MULTILINE);
    // If you actually need the "Frame etc." header matched as well, here's
    // an alternate Pattern:
    // Pattern p = Pattern.compile("(Frame\\s\\d\\s.+?(?=Frame|$))", Pattern.MULTILINE);
    // matching...
    Matcher m = p.matcher(contents);
    // iterating over matches and printing out group 1
    while (m.find()) {
        System.out.println("Found: " + m.group(1));
    }
}
// "handling" FileNotFoundException
catch (Throwable t) {
    t.printStackTrace();
}

Output: 输出:

Found: (186 bytes on wire, 186 bytes captured)    Arrival Time: Sep 19, 2013 13:25:19.937150000    [Time delta from previous captured frame: 0.000000000 seconds]    [Time delta from previous displayed frame: 0.000000000 seconds]    [Time since reference or first frame: 0.000000000 seconds]    
Found: (60 bytes on wire, 60 bytes captured)    Arrival Time: Sep 19, 2013 13:25:19.938495000    [Time delta from previous captured frame: 0.001345000 seconds]    [Time delta from previous displayed frame: 0.001345000 seconds]

Explanation of the Pattern : Pattern说明:

  • It looks for text starting with your original pattern more or less ("Frame, space, digit, space") 它以或多或少以您的原始模式(“框架,空格,数字,空格”)开头的形式查找文本
  • It stores whatever comes next including line breaks, but stops when either a new "Frame" text appears, or the end of the input text does 它存储接下来出现的所有内容,包括换行符,但是会在出现新的“框架”文本或输入文本末尾出现时停止
  • The text matching point 2 is stored in a group (group 0 is the whole match, specific groups start at index 1) 文本匹配点2存储在一个组中(组0是整个匹配项,特定组从索引1开始)

Edit: hints on performance and memory optimization 编辑:有关性能和内存优化的提示

  • Small step but obvious: declare the Pattern as a constant so it compiles only once 小步长但显而易见:将Pattern声明为常量,这样它只能编译一次

  • Instead of populating an ArrayList which will grow with every match, write each match to a single file in some folder - this will perform slowly, but if well implemented, should allow garbage collection to take place for the matched String at each iteration of the while (m.find()) loop 与其填充将随每次匹配而增长的ArrayList ,而是将每个匹配写入到某个文件夹中的单个文件-这将执行缓慢,但如果实施得当,则应允许在while (m.find())每次迭代中对匹配的String进行垃圾回收。 while (m.find())循环

  • Once the iteration has terminated, you will have to process each small file iteratively again 迭代终止后,您将不得不再次迭代处理每个小文件

  • If this is not sufficient or just doesn't work against the size of your data, you might want to implement your own custom parser, or pre-chunk the data somehow, but this is quite out of scope, considering your original question was about the Pattern itself, not the performance 如果这还不够或者不能解决数据的大小,您可能想要实现自己的自定义解析器,或者以某种方式对数据进行预整理,但这超出了范围,考虑到您最初的问题是关于Pattern本身,而不是性能

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM