仅从HTML获取href内容和src内容

Question

I am wondering how to extract only href and src content from html content. 我想知道如何从html内容中仅提取href和src内容。 I tried regular expression but I failed. 我尝试使用正则表达式，但失败了。

This is the text that I want to get href and src content from: 这是我要从中获取href和src内容的文本：

<a href="http://rdmobile.fr/blog/mobile-la-pub-consomme-plus-que-les-applications-elles-memes/"><img align="left" hspace="5" width="150" height="150" src="http://rdmobile.fr/blog/wp-content/uploads/2012/03/angry-birds-150x150.jpg" class="alignleft tfe wp-post-image" alt="angry-birds" title="angry-birds" /></a>Si vous aussi vous vous étonnez de voir votre batterie fondre comme neige au soleil dès lors que jouez à Angry Birds, rassurez-vous, c’est normal. Des chercheurs de l&#8217;université de Purdue se sont intéressés aux publicités destinées majoritairement aux applications gratuites, et oui, comment les développeurs mangent-ils autrement ? Plus sérieusement, cette étude, publiée sur le [...]

I want to extract data like this. 我想这样提取数据。

href content : http://rdmobile.fr/blog/mobile-la-pub-consomme-plus-que-les-applications-elles-memes/ src content : http://rdmobile.fr/blog/wp-content/uploads/2012/03/angry-birds-150x150.jpg href内容： http : //rdmobile.fr/blog/mobile-la-pub-consomme-plus-que-les-applications-elles-memes/ src内容： http : //rdmobile.fr/blog/wp-content/上传/2012/03/angry-birds-150x150.jpg

Can any one help me with this and I like to learn basic regular expression too. 谁能帮助我，我也喜欢学习基本的正则表达式。

Thanks, Isuru 谢谢Isuru

Answer 1

A DOM parser like JSoup is great for this type of problem, and allows for straight-forward interactions with the document & using CSS style selectors: 像JSoup这样的DOM解析器非常适合此类问题，它允许与文档进行直接交互并使用CSS样式选择器：

Document document = Jsoup.connect(url).get();
Elements elementsWithSrcAttributes = document.select("[src]");
Elements elementsWithHrefAttributes = document.select("[href]");

for (Element element: elementsWithSrcAttributes) {
    System.out.println("src content: " + element.attr("src"));
}

for (Element element: elementsWithHrefAttributes) {
    System.out.println("href content: " + element.attr("href"));
}

Answer 2

You could parse the content using an XML parser. 您可以使用XML解析器解析内容。

Look at Parsing XML Data 查看解析XML数据

Answer 3

You don't want to use regular expressions for that. 您不想为此使用正则表达式。 Just... just don't. 只是...只是不。 Bad things happen . 坏事发生了。

What you want to use is XPath . 您要使用XPath 。 For a given HTML document, the /a/@href XPath expression will return all href attributes of a nodes. 对于给定的HTML文档时， /a/@href XPath表达式将返回所有href属性的a节点。 Think of it as regular expressions for XML. 将其视为XML的正则表达式。

The hard part isn't XPath, which is relatively straightforward, but obtaining a valid DOM from an HTML file. 困难的部分不是XPath，它相对简单，但是可以从HTML文件中获取有效的DOM。 I'd recommend Cyberneko , but have no idea whether that's compatible with your Android requirement. 我会推荐Cyberneko ，但不知道这是否与您的Android要求兼容。

Answer 4

Extracting data from html using regular expressions is not generally recommended, but the following is an example of one basic approach 通常不建议使用正则表达式从html提取数据，但是以下是一种基本方法的示例

String str = "<a href=\"http://rdmobile.fr/blog/mobile-la-pub-consomme-plus-que-les-applications-elles-memes/\"><img align=\"left\" hspace=\"5\" width=\"150\" height=\"150\" src=\"http://rdmobile.fr/blog/wp-content/uploads/2012/03/angry-birds-150x150.jpg\" class=\"alignleft tfe wp-post-image\" alt=\"angry-birds\" title=\"angry-birds\" /></a>Si vous aussi vous vous étonnez de voir votre batterie fondre comme neige au soleil dès lors que jouez à Angry Birds, rassurez-vous, c’est normal. Des chercheurs de l&#8217;université de Purdue se sont intéressés aux publicités destinées majoritairement aux applications gratuites, et oui, comment les développeurs mangent-ils autrement ? Plus sérieusement, cette étude, publiée sur le [...]";        
Matcher m = Pattern.compile(" (?:href|src)=\"([^\"]+)").matcher(str);

while (m.find()) {
    System.out.println(m.group(1));
}

The above will only match any sequence of one or more characters that are not " , when it is preceded by either ' href="' or ' src="' . 当前面带有' href="'或' src="'时，以上内容仅会匹配一个或多个非"字符序列。

Therefore it will not match if single or no quotes surround the attribute value or if there are any spaces around the = . 因此，如果单引号或无引号引起来的属性值或=周围有空格，则它将不匹配。

Further explanation on request, or see, for example, Regular-Expressions.info . 根据要求提供进一步的解释，或参见例如Regular-Expressions.info 。

仅从HTML获取href内容和src内容

问题描述

4 个解决方案

解决方案1
2 2013-04-11 12:45:52

解决方案2
0 2013-04-11 12:32:53

解决方案3
0 2013-04-11 12:55:02

解决方案4
-1 已采纳 2013-04-11 15:24:49

仅从HTML获取href内容和src内容

问题描述

4 个解决方案

解决方案1 2 2013-04-11 12:45:52

解决方案2 0 2013-04-11 12:32:53

解决方案3 0 2013-04-11 12:55:02

解决方案4 -1 已采纳 2013-04-11 15:24:49

解决方案1
2 2013-04-11 12:45:52

解决方案2
0 2013-04-11 12:32:53

解决方案3
0 2013-04-11 12:55:02

解决方案4
-1 已采纳 2013-04-11 15:24:49