Java Regex不适用于特殊字符

Question

I got a problem with my parser. 我的解析器有问题。 I want to read an image-link on a webiste and this normally works fine. 我想阅读网站上的图片链接，这通常可以正常工作。 But today I got a link that contains special chars and the usual regex did not work. 但是今天，我得到了一个包含特殊字符的链接，而通常的正则表达式不起作用。

This is how my code looks like. 这就是我的代码的样子。

Pattern t = Pattern.compile(regex.trim());

Matcher x = t.matcher(content[i].toString());
if(x.find())
{
    values[i] = x.group(1);
}

And this is the part of html, that causes trouble 这是html的一部分，会引起麻烦

<div class="open-zoomview zoomlink" itemscope="" itemtype="http://schema.org/Product"> 
<img class="zoomLink productImage" src="

http://tnm.scene7.com/is/image/TNM/template_335x300?$plus_335x300$&amp;$image=is{TNM/1098845000_prod_001}&amp;$ausverkauft=1&amp;$0prozent=1&amp;$versandkostenfrei=0" alt="Produkt Atika HB 60 Benzin-Heckenschere" title="Produkt Atika HB 60 Benzin-Heckenschere" itemprop="image" /> 
</div>

And this is the regex I am using to get the part in the src-attribute: 这是我用来获取src属性部分的正则表达式：

<img .*src="(.*?)" .*>

I believe that it has something to do with all the special character inside the link. 我认为这与链接内的所有特殊字符有关。 But I'm not sure how to escape all of them. 但是我不确定如何逃避所有这些。 I Already tried 我已经尝试过

Pattern.quote(content[i].toString())

But the outcome was the same: nothing found. 但是结果是一样的：没有发现任何东西。

Answer 1

The . 的. character usually only matches everything except new line characters. 字符通常只匹配除换行符以外的所有字符。 Therefore, your pattern won't match if there are newlines in the img-tag. 因此，如果img标签中包含换行符，则您的模式将不匹配。

Use Pattern.compile(..., Pattern.DOTALL) or prepend your pattern with (?s) . 使用Pattern.compile(..., Pattern.DOTALL)或在模式前加上(?s) 。

In dotall mode, the expression . 在dotall模式下，表达式。 matches any character, including a line terminator. 匹配任何字符，包括行终止符。 By default this expression does not match line terminators. 默认情况下，此表达式不匹配行终止符。

http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html#DOTALL http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html#DOTALL

Answer 2

您实际上应该将<img\\\\s\\\\.*?\\\\bsrc=["'](\\\\.*?)["']\\\\.*?>与(?s)修饰符一起使用。

Answer 3

您的正则表达式应为：

String regex = "<img .*src=\"(.*?)\" .*>";

Answer 4

This probably caused by the newline within the tag. 这可能是由标记内的换行符引起的。 The . 的。 character won't match it. 字符不匹配。

Did you consider not using regex to parse HTML? 您是否考虑过不使用正则表达式来解析HTML？ Using regex for HTML parsing is notoriously fragile construct. 使用正则表达式进行HTML解析是众所周知的脆弱构造。 Please consider using a parsing library such as JSoup for this. 请考虑为此使用诸如JSoup之类的解析库。

Java Regex不适用于特殊字符

问题描述

4 个解决方案

解决方案1
2 已采纳 2012-09-27 13:19:32

解决方案2
0 2012-09-27 13:14:44

解决方案3
0 2012-09-27 13:21:21

解决方案4
0 2012-09-27 13:22:24

Java Regex不适用于特殊字符

问题描述

4 个解决方案

解决方案1 2 已采纳 2012-09-27 13:19:32

解决方案2 0 2012-09-27 13:14:44

解决方案3 0 2012-09-27 13:21:21

解决方案4 0 2012-09-27 13:22:24

解决方案1
2 已采纳 2012-09-27 13:19:32

解决方案2
0 2012-09-27 13:14:44

解决方案3
0 2012-09-27 13:21:21

解决方案4
0 2012-09-27 13:22:24