正则表达式问题-尝试提取多个 <article> 小号

Question

I have a web page that I need to extract information from. 我有一个网页，我需要从中提取信息。

There are multiple <article> tags that need to be cycled through (I need to extract content from within them). 有多个<article>标签需要循环浏览（我需要从其中提取内容）。 Each article tag has many attributes, "id", "class", etc. 每个商品标签都有许多属性，例如“ id”，“ class”等。

I have no idea how to write the Regex that I require. 我不知道如何编写所需的Regex。

What I have so far is: 到目前为止，我有：

<article ([a-zA-Z\\s"\\S][^>]*)>

This is capable of extracting all tags with their attributes, however, I don't know how to capture the information WITHIN the tags. 这能够提取具有其属性的所有标签，但是，我不知道如何在标签内捕获信息。

I feel like I need to write regex similar to: "get everything within <article ([a-zA-Z\\s"\\S][^>]*)> until you reach the next </article> tag.", but have no idea how to do that... 我觉得我需要编写类似于以下内容的正则表达式：“在<article ([a-zA-Z\\s"\\S][^>]*)>直到到达下一个</article>标记。”，但不知道该怎么做...

Thanks for your input 感谢您的输入

Answer 1

Regex? 正则表达式？ Please reconsider . 请重新考虑。 From one of your comments: "I was building this for a Chrome Extension so it was being done with JavaScript." 来自您的评论之一：“我正在为Chrome扩展程序构建它，因此它是使用JavaScript完成的。” Then I suggest you use the browser's built-in XML DOM parser. 然后，我建议您使用浏览器的内置XML DOM解析器。

To load XML from a string variable xmlText : 要从字符串变量xmlText加载XML：

var parser = new DOMParser();
var xmlDoc = parser.parseFromString(xmlText, "text/xml");

To load XML from a separate XML file: 要从单独的XML文件加载XML：

var xhttp = new XMLHttpRequest();
xhttp.open("GET", "articles.xml", false);
xhttp.send();
var xmlDoc = xhttp.responseXML;

This yields a convenient object structure that you can navigate through. 这样会产生一个方便的对象结构，您可以在其中浏览。

var articles = xmlDoc.getElementsByTagName('article');
for (var i = 0; i < articles.length; i++) {
    var article = articles[i];
    var id = article.getAttribute('id');
    var class = article.getAttribute('class');
    var content = article.nodeValue;
    ...
}

Answer 2

Depending on your programming language, you can probably find HTML parsing libraries. 根据您的编程语言，您可能会找到HTML解析库。 If you can not find those, you could probably use libraries that loosely parse XML (parsers that don't require a full valid XML document). 如果找不到这些文件，则可以使用松散地解析XML的库（不需要完整有效XML文档的解析器）。 You could then simply get a list of article elements and parse through them individually. 然后，您可以简单地获取文章元素的列表，并分别解析它们。 In case of an HTML parser you can probably also read out attributes! 如果是HTML解析器，您可能还可以读出属性！

If aforementioned does not work, maybe you could split the text on <\\article>, and then split that text by < article (without the space) and read the second index in the array. 如果上述方法不起作用，也许您可以将<\\ article>上的文本分割开，然后按<article（无空格）分割该文本，并读取数组中的第二个索引。 You can then split that on > and you will be left with the element attributes on the first index, and the content on the second. 然后，您可以在>上将其分割，然后在第一个索引上保留元素属性，在第二个上保留内容。 If anybody finds a regex solution to this that anders this question better, please let me know! 如果有人找到正则表达式解决方案来更好地解决这个问题，请告诉我！

Hope it helps. 希望能帮助到你。

Pim 皮姆

Answer 3

Normally, I hate when people give this answer, but: JQuery can do that for you! 通常，我讨厌别人给出这个答案，但是： JQuery可以为您做到这一点！ . 。 Since you're already using the jQuery framework, take advantage of the secondary functionality of the jQuery function to parse the HTML String into a series of DOM Nodes. 由于您已经在使用jQuery框架，请利用jQuery函数的辅助功能将HTML字符串解析为一系列DOM节点。 You can then use the find function to query the children of your top node!. 然后，您可以使用find函数查询顶级节点的子级！ Your final code will wind up looking something like this: 您的最终代码将如下所示：

$(htmlString)
    .find('article')
    .each(function(index, article) {
        //Extract information from $(article).
    });

正则表达式问题-尝试提取多个 <article> 小号

问题描述

3 个解决方案

解决方案1
1 2014-11-29 22:35:00

解决方案2
0 2014-11-29 21:36:48

解决方案3
0 2014-11-29 23:24:37

正则表达式问题-尝试提取多个 <article> 小号

问题描述

3 个解决方案

解决方案1 1 2014-11-29 22:35:00

解决方案2 0 2014-11-29 21:36:48

解决方案3 0 2014-11-29 23:24:37

解决方案1
1 2014-11-29 22:35:00

解决方案2
0 2014-11-29 21:36:48

解决方案3
0 2014-11-29 23:24:37