简体   繁体   English

正则表达式解析 XML - RSS 提要

[英]Regex to Parse XML - RSS Feed

<atom:link rel="self" href="http://www.independent.co.uk/"/>
<item>
<title>
Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump&apos;s desk
</title>
<link>
https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html
</link>
<description>
<![CDATA[
News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day.
]]>
</description>

For the content above i would like to extract the title, link and description How can I formulate my regex rule to capture this?对于上面的内容,我想提取标题、链接和描述如何制定我的正则表达式规则来捕获它?

The end goal being to dump the extracted content to a predefined sql db that i created最终目标是将提取的内容转储到我创建的预定义 sql db

As suggested in comments most likely you should be using an XML parser and not regex, but as the format of the RSS feed is probably consistent and quite simple a regex solution might work too.正如评论中所建议的,您很可能应该使用 XML 解析器而不是正则表达式,但由于 RSS 提要的格式可能一致且非常简单,正则表达式解决方案也可能有效。

For the current example you can use:对于当前示例,您可以使用:

<(.+)>\s*(?:<!\[CDATA\[)?\s*(.*)\s*(?:]]>)?\s*<\/\1>

Explanation:解释:

  • <(.+)> - matches opening tag, captures the name <(.+)> - 匹配开始标签,捕获名称
  • \\s* - matches optional whitespace characters (new line in your example) \\s* - 匹配可选的空白字符(示例中的新行)
  • (?:<!\\[CDATA\\[)? - non-capturing group for <![CDATA[ , matched 0 or 1 times - <![CDATA[非捕获组,匹配 0 或 1 次
  • \\s* - matches optional whitespace characters \\s* - 匹配可选的空白字符
  • (.*) - capturing group that will catch any characters (.*) - 将捕获任何字符的捕获组
  • \\s* - matches optional whitespace characters \\s* - 匹配可选的空白字符
  • (?:]]>)? - non-capturing group for ]]> (CDATA closing), matched 0 or 1 times - ]]>非捕获组(CDATA 关闭),匹配 0 次或 1 次
  • \\s* - matches optional whitespace characters \\s* - 匹配可选的空白字符
  • <\\/\\1> - matches closing tag with same name as opening tag (backreference to 1st capture group) <\\/\\1> - 匹配与开始标签同名的结束标签(对第一个捕获组的反向引用)

 let input = `<title> Coronavirus: Why the Covid-19 economic stimulus deal will make it to Trump&apos;s desk </title> <link> https://www.independent.co.uk/news/world/americas/us-politics/coronavirus-economic-stimulus-deal-covid-19-trump-bill-senate-house-a9419976.html </link> <description> <![CDATA[ News Analysis: When Senate tries to pass major bills, there's always one day of chaos. Monday appears to be that day. ]]> </description>`; let regex = /<(.+)>\\s*(?:<!\\[CDATA\\[)?\\s*(.*)\\s*(?:]]>)?\\s*<\\/\\1>/g; let result; do { result = regex.exec(input); if (result) { console.log(result[1] + ": " + result[2]); } } while (result);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM