[英]Need regex to match first instance until next instance (excluding next “look-ahead”) Java
I'm new to programming and regular expressions so that's my disclaimer. 我是编程和正则表达式的新手,所以这是我的免责声明。
I'm trying to parse my way through a wireshark log that I've transferred over to a txt file using tshark. 我正在尝试通过Wireshark日志解析我的方式,该日志已使用tshark转移到txt文件。
The point of my program is to start at the top of the txt file and match all text between packet headers. 我的程序的重点是从txt文件的顶部开始,并匹配数据包头之间的所有文本。
All packets begin with Frame\\s+\\d
, which excluding the next packet header and drop that text in a string. 所有数据包均以Frame\\s+\\d
开头,其中不包括下一个数据包头,并将该文本放入字符串中。
I'm instantiating an object ( Packets
) and then adding them to an ArrayList
for later processing. 我正在实例化一个对象( Packets
),然后将它们添加到ArrayList
以供以后处理。
I need to gather all text from packet header 1 to end of packet 1 / beginning of packet header 2, without including packet header 2. 我需要收集从数据包头1到数据包1的末尾/数据包头2的开头的所有文本,而不包括数据包头2。
Frame 1 (186 bytes on wire, 186 bytes captured)
Arrival Time: Sep 19, 2013 13:25:19.937150000
[Time delta from previous captured frame: 0.000000000 seconds]
[Time delta from previous displayed frame: 0.000000000 seconds]
[Time since reference or first frame: 0.000000000 seconds]
Frame Number: 1
Frame Length: 186 bytes
Capture Length: 186 bytes
[Frame is marked: False]
[Protocols in frame
............................A bunch of more packet data...............
Encrypted Packet: 88FE0AFA38B3E1994B907F778FC42CD4FBD967F3D9101679...
Frame 2 (60 bytes on wire, 60 bytes captured)
Arrival Time: Sep 19, 2013 13:25:19.938495000
[Time delta from previous captured frame: 0.001345000 seconds]
[Time delta from previous displayed frame: 0.001345000 seconds]
I've tried: 我试过了:
(Frame\s\d)*.?Frame\s\d
But not dice. 但不是骰子。
I've been plugging away on rubular.com to see if I can hit paydirt on this but I can't seem to match what I need. 我一直在访问rubular.com,看看是否可以按此付款,但似乎无法满足我的需求。
Thoughts? 有什么想法吗?
Considering a file packets.txt
in /your/path
, containing the example you posted... 考虑/your/path
中的一个packets.txt
文件,其中包含您发布的示例...
Here's a solution. 这是一个解决方案。
try {
// trivial file operations
String path = "/your/path/packets.txt";
File file = new File(path);
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
String line = null;
StringBuilder contents = new StringBuilder();
while ((line = br.readLine()) != null) {
contents.append(line);
}
br.close();
// the Pattern
Pattern p = Pattern.compile("Frame\\s\\d\\s(.+?(?=Frame|$))", Pattern.MULTILINE);
// If you actually need the "Frame etc." header matched as well, here's
// an alternate Pattern:
// Pattern p = Pattern.compile("(Frame\\s\\d\\s.+?(?=Frame|$))", Pattern.MULTILINE);
// matching...
Matcher m = p.matcher(contents);
// iterating over matches and printing out group 1
while (m.find()) {
System.out.println("Found: " + m.group(1));
}
}
// "handling" FileNotFoundException
catch (Throwable t) {
t.printStackTrace();
}
Output: 输出:
Found: (186 bytes on wire, 186 bytes captured) Arrival Time: Sep 19, 2013 13:25:19.937150000 [Time delta from previous captured frame: 0.000000000 seconds] [Time delta from previous displayed frame: 0.000000000 seconds] [Time since reference or first frame: 0.000000000 seconds]
Found: (60 bytes on wire, 60 bytes captured) Arrival Time: Sep 19, 2013 13:25:19.938495000 [Time delta from previous captured frame: 0.001345000 seconds] [Time delta from previous displayed frame: 0.001345000 seconds]
Explanation of the Pattern
: Pattern
说明:
Edit: hints on performance and memory optimization 编辑:有关性能和内存优化的提示
Small step but obvious: declare the Pattern
as a constant so it compiles only once 小步长但显而易见:将Pattern
声明为常量,这样它只能编译一次
Instead of populating an ArrayList
which will grow with every match, write each match to a single file in some folder - this will perform slowly, but if well implemented, should allow garbage collection to take place for the matched String
at each iteration of the while (m.find())
loop 与其填充将随每次匹配而增长的ArrayList
,而是将每个匹配写入到某个文件夹中的单个文件-这将执行缓慢,但如果实施得当,则应允许在while (m.find())
每次迭代中对匹配的String
进行垃圾回收。 while (m.find())
循环
Once the iteration has terminated, you will have to process each small file iteratively again 迭代终止后,您将不得不再次迭代处理每个小文件
If this is not sufficient or just doesn't work against the size of your data, you might want to implement your own custom parser, or pre-chunk the data somehow, but this is quite out of scope, considering your original question was about the Pattern
itself, not the performance 如果这还不够或者不能解决数据的大小,您可能想要实现自己的自定义解析器,或者以某种方式对数据进行预整理,但这超出了范围,考虑到您最初的问题是关于Pattern
本身,而不是性能
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.