简体   繁体   English

正则表达式从IMG标签中找到src

[英]Regular Expression to find src from IMG tag

I have a web page. 我有一个网页。 From that i want to find all the IMG tags and get the SRC of those IMG tags. 从那里我想找到所有IMG标签并获得那些IMG标签的SRC。

What will be the regular expression to do this. 这样做的正则表达式是什么。

Some explanation: 一些解释:

I am scraping a web page. 我正在抓一个网页。 All the data is displayed correctly except the images. 除图像外,所有数据都正确显示。 To solve this, wow i have an idea, to find the SRC and replace it : eg 要解决这个问题,哇我有一个想法,找到SRC并替换它:例如

/images/header.jpg

and replace this with 并替换它

www.stackoverflow/images/header.jpg

You don't want a regular expression, you want a parser. 你不想要一个正则表达式,你想要一个解析器。 From this question : 从这个问题

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var doc = web.Load("http://www.stackoverflow.com");

        var nodes = doc.DocumentNode.SelectNodes("//img[@src]");

        foreach (var node in nodes)
        {
                Console.WriteLine(node.src);
        }
    }
}

As pointed out, regular expression are not the perfect solution, but you can usually build one that is good enough for the job. 正如所指出的那样,正则表达式并不是完美的解决方案,但你通常可以构建一个对于工作来说足够好的解决方案。 This is what I would use: 这是我会用的:

string newHtml = Regex.Replace(html,
      @"(?<=<img\s+[^>]*?src=(?<q>['""]))(?<url>.+?)(?=\k<q>)",
      m => "http://www.stackoverflow.com" + m.Value);

It will match src attributes delimited by single or double quotes. 它将匹配由单引号或双引号分隔的src属性。

Of course, you would have to change the lambda/delegate to do your own replacing logic, but you get the idea :) 当然,你必须改变lambda / delegate来做你自己的替换逻辑,但是你明白了:)

I have to agree with the parser-crowd on this one. 我不得不同意这个解析器人群。 In order of increasing input complexity, the hierarchy I choose from is: 为了增加输入复杂性,我选择的层次结构是:

  • substrings; 子;
  • regexes; 正则表达式; and
  • parsers. 解析器。

While regexes can handle much more complicated inputs than simple substring operations, they tend to barf pretty easily when faced with the really hairy input possibilities of free-form markup languages. 虽然正则表达式可以处理比简单子字符串操作更复杂的输入,但是当面对自由格式标记语言的真正多毛输入可能性时,它们往往很容易barf。

XML DOM parsers will be the easiest solution for this problem. XML DOM解析器将是解决此问题的最简单方法。

You can use regexes (and they'll work reasonably well if you restrict the input format, such as ensuring img tags don't cross line boundaries and so on), but the simplicity of a parser-based solution will blow regexes out of the water for multi-line, attributes-in-any-order DOM tags. 您可以使用正则表达式(如果您限制输入格式,它们将会运行得相当好,例如确保img标签不跨越边界等等),但基于解析器的解决方案的简单性会将正则表达式排除在外用于多行,任意顺序DOM标记的水。

Remember that the source could be generated through javascript, so you may not be able to "just" do a regex replacement for img src. 请记住,源代码可以通过javascript生成,因此您可能无法“只”为img src执行正则表达式替换。

Using Mechanize/Hpricot/Nokogiri in ruby: 在ruby中使用Mechanize / Hpricot / Nokogiri:

require 'mechanize'
agent = WWW::Mechanize.new
page  = agent.get('http://www.google.com')
(page/"img").each { |img| puts img['src'] = "http://www.yahoo.com" + img['src'] }

And you are done! 你完成了!

/// <summary>
/// Gets the src from an IMG tag
/// Assigns proper values to link and name, if the htmlId matches the pattern
/// </summary>
/// <param name="htmlTd">Html containing IMG tag</param>
/// <param name="link">Contains the src contents</param>
/// <param name="name">Contains img element content</param>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetImgDetails(string htmlTd, out string link, out string name)
{
    link = null;
    name = null;

    string pattern = "<img\\s*src\\s*=\\s*(?:\"(?<link>[^\"]*)\"|(?<link>\\S+))\\s*>(?<name>.*)\\s*</img>";

    if (Regex.IsMatch(htmlTd, pattern))
    {
        Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
        link = r.Match(htmlTd).Result("${link}");
        name = r.Match(htmlTd).Result("${name}");
        return true;
    }
    else
        return false;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM