[英]Regexp to find images in html (golang)
I'm parsing an xml rss feed from a couple of different sources and I want to find the images in the html. 我正在从几个不同的来源解析xml rss提要,我想在html中找到图像。
I did some research and I found a regex that I think might work 我做了一些研究,发现了我认为可能有用的正则表达式
/<img[^>]+src="?([^"\s]+)"?\s*\/>/g
but I have trouble using it in go. 但我无法在旅途中使用它。 It gives me errors because I don't know how to make it search with that expression.
它给了我错误,因为我不知道如何使用该表达式进行搜索。
I tried using it as a string, it doesn't escape properly with single or with double quotes. 我尝试将其用作字符串,单引号或双引号无法正确转义。 I tried using it just like that, bare, and it gives me an error.
我只是这样尝试使用它,但它给了我一个错误。
Any ideas? 有任何想法吗?
Using a proper html parser is always better for parsing html, however a cheap / hackish regex can also work fine, here's an example: 使用适当的html解析器始终比解析html更好,但是便宜的/ hackish正则表达式也可以正常工作,下面是一个示例:
var imgRE = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
// if your img's are properly formed with doublequotes then use this, it's more efficient.
// var imgRE = regexp.MustCompile(`<img[^>]+\bsrc="([^"]+)"`)
func findImages(htm string) []string {
imgs := imgRE.FindAllStringSubmatch(htm, -1)
out := make([]string, len(imgs))
for i := range out {
out[i] = imgs[i][1]
}
return out
}
Ah so, sorry,Not worked with Go before but this seems work. 嗯,对不起,以前没有使用过Go,但这似乎可行。 tryed at
尝试过
https://tour.golang.org/welcome/1
. 。
package main
import (
"fmt"
"regexp"
)
func main() {
var myString = `<img src='img1single.jpg'><img src="img2double.jpg">`
var myRegex = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
out := make([]string, len(imgTags))
for i := range out {
fmt.Println(imgTags[i][1])
}
}
I suggest to use htmlagility to parse any dom/xml kind a. 我建议使用htmlagility来解析任何dom / xml类型。
Read document by; 阅读文档依据;
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
Parse by Xpath definition RegX fine but group ext. 通过Xpath定义RegX进行解析,但可以进行ext分组。 issues makes job complex
问题使工作变得复杂
doc.DocumentNode.SelectSingleNode(XPath here)
or 要么
doc.DocumentNode.SelectNodes("//img") // this should give all img tags
like. 喜欢。
i suggest this becouse it seems rss serves some html content ;) So get xml, parse with XMLDoc get html content that you need then get all images by this. 我建议这样做,因为看来rss提供了一些html内容;)因此,获取xml,使用XMLDoc进行解析,获取所需的html内容,然后由此获取所有图像。 For open answer.
公开答案。
after comment just need regex i think ; 我想评论后只需要正则表达式; my pattern is
我的模式是
<img.+?src=[\"'](.+?)[\"'].*?>
for input 用于输入
<img src='img1single.jpg'>
<img src="img2double.jpg">
and result seems fine in .net you must get by foreach via .net中的结果似乎很好,您必须通过以下方式获取foreach
.Groups[1].Value
regards. 问候。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.