简体   繁体   English

如何捕获Java中正则表达式匹配后的文本?

[英]How do you capture the text that follows a Regex match in Java?

I am working on an assignment in which I need to search through a web site and extract conversion rates. 我正在做一个作业,其中需要搜索网站并提取转换率。

If I was able to simply match the rates this would be simple to capture and extract, but I need to be able to hit an update button and have the program search for the updated conversion rates, so I am not able to simply hard code a match to search for. 如果我能够简单地匹配这些比率,那么将很容易捕获和提取,但是我需要能够点击“更新”按钮并让程序搜索更新后的转化率,因此我无法简单地对匹配搜索。

Is there a way in which I can match the text that precedes the rates and capture all text that follows the match? 有没有一种方法可以匹配汇率前面的文本并捕获匹配之后的所有文本?

If there is a better way to do this I am also open to suggestions. 如果有更好的方法可以这样做,我也欢迎您提出建议。 I just need help to know how to get the rates from the website when I do not know exactly what the rates will be. 当我不确切知道汇率多少时,我只需要帮助来了解如何从网站获取汇率。 I simply know the format of the rates and where the rates are located within the site. 我只知道费率的格式以及费率在网站内的位置。

Here is what I have so far: 这是我到目前为止的内容:

String regex = "(?<=EUR'>)\\d+(?:\\.\\d*)?(?=<)";

Pattern pattern = Pattern.compile(regex);
Matcher match = pattern.matcher(?);

while (match.find()) {   
  System.out.println("Found a match: " + match.group(1).toString());  
  System.out.println("Start position: " + match.start(1)); 
  System.out.println("End position: " + match.end(1)); 
} 

I think i understand how to set up the pattern, but I am unsure as to what I should put for the match string if I only know what the beginning and end will be and not the end... 我想我知道如何设置模式,但是如果我只知道开始和结束而不是结束,我不确定应该为匹配字符串添加什么...

An example of what I would need to grab is the line below 我需要抓住的一个例子是下面的行

<td class='rtRates'><a href='/graph/?from=USD&amp;to=EUR'>0.772000</a></td>

I need to grab the rate in this line, but it will constantly be changing 我需要把握这条线的速度,但是它将不断变化

Do not use regex to parse html, or a velociraptor will come and eat you. 不要使用正则表达式来解析html,否则速激肽会来吃你。 Use something like jsoup and query value of an <a> element that is inside a <td> with class rtRates . 使用类似jsoup<td>rtRates类的<a>元素的查询值。

I am not sure what your problem is, because your expression is matching what you expect (I think). 我不确定您的问题是什么,因为您的表情是否符合您的期望(我认为)。 See it on Regexr . 在Regexr上查看

If you want to be more flexible on what the part looks like between the tags, you can use this: 如果要更灵活地使用标签之间的零件外观,可以使用以下方法:

(?<=EUR'>)[^<]*

The [^<] is a negatied character class. [^<]是一个否定的字符类。 It will match any character but < . 它将匹配除<任何字符。 Then you can also remove the lookahead assertion. 然后,您也可以删除先行断言。

See it on Regexr 在Regexr上查看

Can't you just use this? 你不能只用这个吗?

EUR'>(\d+(?:\.\d+)?)<

The rate is captured in group #1, which is handy since you're already using group(1) to extract it. 该速率在组#1中捕获,这很方便,因为您已经在使用group(1)提取它。 ;) But seriously, there are no capturing groups in your regex, so calling group(1) on the Matcher result an exception. ;)但是,严重的是,您的正则表达式中没有捕获组,因此在Matcher上调用group(1)导致异常。 What gives? 是什么赋予了?

ps Notice that I changed your \\d* to \\d+ . ps请注意,我将您的\\d*更改为\\d+ Almost everyone who tries to match decimal numbers requires at least one digit after the decimal point. 几乎所有试图匹配十进制数字的人都要求小数点后至少一位数字。 If that's not the case here, go ahead and change it back. 如果不是这种情况,请继续进行更改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM