简体   繁体   English

正则表达式以html(golang)查找图像

[英]Regexp to find images in html (golang)

I'm parsing an xml rss feed from a couple of different sources and I want to find the images in the html. 我正在从几个不同的来源解析xml rss提要,我想在html中找到图像。

I did some research and I found a regex that I think might work 我做了一些研究,发现了我认为可能有用的正则表达式

/<img[^>]+src="?([^"\s]+)"?\s*\/>/g

but I have trouble using it in go. 但我无法在旅途中使用它。 It gives me errors because I don't know how to make it search with that expression. 它给了我错误,因为我不知道如何使用该表达式进行搜索。

I tried using it as a string, it doesn't escape properly with single or with double quotes. 我尝试将其用作字符串,单引号或双引号无法正确转义。 I tried using it just like that, bare, and it gives me an error. 我只是这样尝试使用它,但它给了我一个错误。

Any ideas? 有任何想法吗?

Using a proper html parser is always better for parsing html, however a cheap / hackish regex can also work fine, here's an example: 使用适当的html解析器始终比解析html更好,但是便宜的/ hackish正则表达式也可以正常工作,下面是一个示例:

var imgRE = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
// if your img's are properly formed with doublequotes then use this, it's more efficient.
// var imgRE = regexp.MustCompile(`<img[^>]+\bsrc="([^"]+)"`)
func findImages(htm string) []string {
    imgs := imgRE.FindAllStringSubmatch(htm, -1)
    out := make([]string, len(imgs))
    for i := range out {
        out[i] = imgs[i][1]
    }
    return out
}

playground 操场

Ah so, sorry,Not worked with Go before but this seems work. 嗯,对不起,以前没有使用过Go,但这似乎可行。 tryed at 尝试过

https://tour.golang.org/welcome/1

.

package main

import (
     "fmt"
     "regexp"
)

func main() {
   var myString = `<img src='img1single.jpg'><img src="img2double.jpg">`
   var myRegex = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
   var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
   out := make([]string, len(imgTags))
  for i := range out {
    fmt.Println(imgTags[i][1])
   }
 }

I suggest to use htmlagility to parse any dom/xml kind a. 我建议使用htmlagility来解析任何dom / xml类型。

Read document by; 阅读文档依据;

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml); 

Parse by Xpath definition RegX fine but group ext. 通过Xpath定义RegX进行解析,但可以进行ext分组。 issues makes job complex 问题使工作变得复杂

doc.DocumentNode.SelectSingleNode(XPath here)      

or 要么

doc.DocumentNode.SelectNodes("//img")  // this should give all img tags 

like. 喜欢。

i suggest this becouse it seems rss serves some html content ;) So get xml, parse with XMLDoc get html content that you need then get all images by this. 我建议这样做,因为看来rss提供了一些html内容;)因此,获取xml,使用XMLDoc进行解析,获取所需的html内容,然后由此获取所有图像。 For open answer. 公开答案。

after comment just need regex i think ; 我想评论后只需要正则表达式; my pattern is 我的模式是

 <img.+?src=[\"'](.+?)[\"'].*?>

for input 用于输入

<img src='img1single.jpg'>
<img src="img2double.jpg">

and result seems fine in .net you must get by foreach via .net中的结果似乎很好,您必须通过以下方式获取foreach

.Groups[1].Value

regards. 问候。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM