简体   繁体   English

c#正则表达式匹配img src =“*”类型的URL

[英]c# regular expression to match img src=“*” type URLs

I have a regex in c# that i'm using to match image tags and pull out the URL. 我在c#中有一个正则表达式,我用来匹配图像标签并拉出URL。 My code is working in most situations. 我的代码在大多数情况下都有效。 The code below will "fix" all relative image URLs to Absolute URLs. 下面的代码将“修复”所有相对图像URL到绝对URL。

The issue is that the regex will not match the following: 问题是正则表达式与以下内容不匹配:

<img height="150" width="202" alt="" src="../Image%20Files/Koala.jpg" style="border: 0px solid black; float: right;">

For example it matches this one just fine 例如,它匹配这个就好了

<img height="147" width="197" alt="" src="../Handlers/SignatureImage.ashx?cid=5" style="border: 0px solid black;">

Any ideas on how to make it match would be great. 关于如何使它匹配的任何想法都会很棒。 I think the issue is the % but I could be wrong. 我认为问题是%,但我可能是错的。

Regex rxImages = new Regex(" src=\"([^\"]*)\"", RegexOptions.IgnoreCase & RegexOptions.IgnorePatternWhitespace);
mc = rxImages.Matches(html);
if (mc.Count > 0)
{
    Match m = mc[0];
    string relitiveURL = html.Substring(m.Index + 6, m.Length - 7);
    if (relitiveURL.Substring(0, 4) != "http")
    {
        Uri absoluteUri = new Uri(baseUri, relitiveURL);
        ret += html.Substring(0, m.Index + 5);
        ret += absoluteUri.ToString();
        ret += html.Substring(m.Index + m.Length - 1, html.Length - (m.Index + m.Length - 1));
        ret = convertToAbsolute(URL, ret);
    }
}

Using RegEx to parse images in this way is a bad idea. 使用RegEx以这种方式解析图像是个坏主意。 See here for a good demonstration of why. 请参阅此处以了解原因。

You can use an HTML parser such as the HTML Agility Pack to parse the HTML and query it using XPath syntax. 您可以使用HTML Agility Pack等HTML解析器来解析HTML并使用XPath语法对其进行查询。

First, I would try to skip all the manual parsing and use linq to html 首先,我会尝试跳过所有手动解析并使用linq到html

HDocument document = HDocument.Load("http://www.microsoft.com");

foreach (HElement element in document.Descendants("img"))
{
   Console.WriteLine("src = " + element.Attribute("src"));
}

If that didn't work, only then would I go back to manual parsing and I'm sure one of the fine gentle-people here has already posted a working regex for your needs. 如果这不起作用,那么我才会回到手动解析,我相信这里的一位优秀温柔的人已经发布了一个正常的工作正则表达式来满足您的需求。

regex is a bad idea. 正则表达式是一个坏主意。 better use an html parser. 更好地使用HTML解析器。 here is aa regex i used for parsing links with regex though: 这里是一个正则表达式我用于解析与正则表达式的链接虽然:

String body = "..."; //body of the page
Matcher m = Pattern.compile("(?im)(?:(?:(?:href)|(?:src))[ ]*?=[ ]*?[\"'])(((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))|((?:\\/{0,1}[\\w\\.]+)+))[\"']").matcher(body);
while(m.find()){
  String absolute = m.group(2);
  String relative = m.group(3);
}

its a lot easier with a parser though, and better on resources. 但是使用解析器会更容易,而且资源也更好。 here is a link showing what i eventually wrote when i switched to a parser. 这是一个链接,显示我切换到解析器时最终写的内容。

http://notetodogself.blogspot.com/2007/11/extract-links-using-htmlparser.html http://notetodogself.blogspot.com/2007/11/extract-links-using-htmlparser.html

probably not as helpful since that was java and you need C# 可能没那么有用,因为那是java而你需要C#

I don't know what your program does, but I'm guessing this is an example of something you would do in 5 minutes from the command line in linux. 我不知道你的程序是做什么的,但我猜这是你在linux下命令行5分钟内会做的事情的一个例子。 You can download windows versions of many of the same tools (sed, for instance) and save yourself the hassle of writing all that code. 您可以下载许多相同工具(例如sed)的Windows版本,并省去编写所有代码的麻烦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM