简体   繁体   English

正则表达式匹配一个URL,但不匹配另一个

[英]Regex for matching one url but not the other

Completely new programmer here having trouble with regular expressions despite trying various online regex testers. 尽管尝试了各种在线正则表达式测试器,但全新的程序员在这里遇到了正则表达式的问题。 I'm working in Eclipse on an Android project I'm querying an openx ad server for a text ad and getting this in return: 我正在Android项目上的Eclipse中工作,正在向openx广告服务器查询文本广告,并得到以下回报:

var OX_abced445 = '';
OX_abced445 += "<"+"a href=\'http://the.server.url/openx/www/delivery/ck.php?oaparams=2__bannerid=29__zoneid=3__cb=e3efa8b703__oadest=http%3A%2F%2Fsomesite.com\'target=\'_blank\'>This is some sample text to test with!<"+"/a><"+"div id=\'beacon_e3efa8b703\'style=\'position: absolute; left: 0px; top: 0px; visibility:hidden;\'><"+"img src=\'http://the.server.url/openx/www/delivery/lg.php?bannerid=29&amp;campaignid=23&amp;zoneid=3&amp;loc=1&amp;cb=e3efa8b703\' width=\'0\'height=\'0\' alt=\'\' style=\'width: 0px; height: 0px;\' /><"+"/div>\n";
document.write(OX_abced445);

I need to extract the first href url but not the img src url so I figure I should have a regex that looks for everything between href=\\' and ' . 我需要提取第一个href网址,而不是img src网址,因此我认为我应该有一个正则表达式来查找href=\\''之间的所有内容。 I also need to extract the target text, ie. 我还需要提取目标文本,即。 This is some sample text to test with! that is encapsulated between the _blank\\'> and <"+"/a> . 封装在_blank\\'><"+"/a> I've found plenty of regexes dealing with extracting urls and such but have struggled to get one working in Eclipse with this particular case. 我发现很多正则表达式都处理提取url之类的问题,但是在这种特殊情况下很难在Eclipse中工作。 Any assistance would be appreciated. 任何援助将不胜感激。

It is a very bad idea to try to parse JavaScript that generates HTML with regex. 尝试解析使用正则表达式生成HTML的JavaScript是一个非常糟糕的主意 Use something like JSoup or Validator.nu for Java or Nokogiri for Ruby instead. 改用 Java的JSoupValidator.nu或Ruby的Nokogiri If you must use a regex: 如果必须使用正则表达式:

Plain regex:
^.*? href=\\'([^']+)\'[^>]*>([^<]*)<

or, in Java:

Pattern p = Pattern.compile("^.*? href=\\\\'([^']+)\\'[^>]*>([^<]*)<", 
                            Pattern.MULTILINE);
Matcher m = p.matcher(hideousString);
m.find();
// Now m.group(1) is the URL and m.group(2) is the text

will capture the href url in capture group 1 and the text in capture group 2, but that will break quickly if the site changes their response format. 会捕获捕获组1中的href网址和捕获组2中的文本,但是如果网站更改其响应格式,则会很快中断。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM