简体   繁体   English

使用正则表达式从html代码中提取第一个图像源?

[英]Using regular expressions to extract the first image source from html codes?

I would like to know how this can be achieved. 我想知道如何实现这一目标。

Assume: That there's a lot of html code containing tables, divs, images, etc. 假设:有很多包含表格,div,图像等的html代码。

Problem: How can I get matches of all occurances. 问题:如何获得所有出现的匹配。 More over, to be specific, how can I get the img tag source (src = ?). 更重要的是,具体来说,我如何获得img标记源(src =?)。

example: 例:

<img src="http://example.com/g.jpg" alt="" />

How can I print out http://example.com/g.jpg in this case. 在这种情况下,如何打印http://example.com/g.jpg I want to assume that there are also other tags in the html code as i mentioned, and possibly more than one image. 我想假设在我提到的html代码中还有其他标签,可能还有多个图像。 Would it be possible to have an array of all images sources in html code? 是否有可能在html代码中拥有所有图像源的数组?

I know this can be achieved way or another with regular expressions, but I can't get the hang of it. 我知道这可以通过正则表达式实现,但我无法理解它。

Any help is greatly appreciated. 任何帮助是极大的赞赏。

While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. 虽然正则表达式可以适用于各种各样的任务,但我发现在解析HTML DOM时通常会出现问题。 The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag. HTML的问题在于,文档的结构变化很大,难以准确(并且准确地说,我的意思是100%的成功率,没有误报)提取标签。

What I recommend you do is use a DOM parser such as SimpleHTML and use it as such: 我建议你做的是使用一个DOM解析器,如SimpleHTML并使用它:

function get_first_image($html) {
    require_once('SimpleHTML.class.php')

    $post_html = str_get_html($html);

    $first_img = $post_html->find('img', 0);

    if($first_img !== null) {
        return $first_img->src;
    }

    return null;
}

Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. 有些人可能认为这样做太过分了,但最终,维护起来会更容易,并且可以提供更多的可扩展性。 For example, using the DOM parser, I can also get the alt attribute. 例如,使用DOM解析器,我也可以获取alt属性。

A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression. 可以设计一个正则表达式来实现相同的目标,但是会限制它强制alt属性在src或相反的方式,并且克服这个限制会增加正则表达式的复杂性。

Also, consider the following. 另外,请考虑以下内容。 To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression: 要使用正则表达式正确匹配<img>标记并仅获取src属性(在第2组中捕获),您需要以下正则表达式:

<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>

And then again, the above can fail if: 如果出现以下情况,上述情况可能会失败:

  • The attribute or tag name is in capital and the i modifier is not used. 属性或标记名称为大写,并且不使用i修饰符。
  • Quotes are not used around the src attribute. src属性周围没有使用引号。
  • Another attribute then src uses the > character somewhere in their value. 然后src另一个属性在其值的某处使用>字符。
  • Some other reason I have not foreseen. 我没有预料到的其他一些原因。

So again, simply don't use regular expressions to parse a dom document. 因此,再次,不要使用正则表达式来解析dom文档。


EDIT: If you want all the images: 编辑:如果你想要所有的图像:

function get_images($html){
    require_once('SimpleHTML.class.php')

    $post_dom = str_get_dom($html);

    $img_tags = $post_dom->find('img');

    $images = array();

    foreach($img_tags as $image) {
        $images[] = $image->src;
    }

    return $images;
}

Use this, is more effective: 使用这个,更有效:

preg_match_all('/<img [^>]*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
    echo $value."<br>";
}

Example: 例:

$html = '
<ul>     
  <li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>       
  <li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>      
  <li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value1.jpg" />
  <li><a href="http://www.verot.net/pretty/">Electronaut Records</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value2.jpg" />
  <li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>     
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value3.jpg" />
</ul>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="res/upload.jpg" />
  <li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>       
  <li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>      
  <li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value4.jpg" />
  <li><a href="http://www.verot.net/pretty/">Electronaut Records</a></li>      
  <img src="value5.jpg" />
  <li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>     
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value6.jpg" />
';   
preg_match_all('/<img .*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
    echo $value."<br>";
} 

Output: 输出:

value1.jpg
value2.jpg
value3.jpg
res/upload.jpg
value4.jpg
value5.jpg
value6.jpg

This works for me: 这对我有用:

preg_match('@<img.+src="(.*)".*>@Uims', $html, $matches);
$src = $matches[1];

i assume all your src= have " around the url 我假设你的所有src =都有“围绕网址

<img[^>]+src=\"([^\"]+)\"

the other answers posted here make other assumsions about your code 这里发布的其他答案会对您的代码做出其他的答案

I agree with Andrew Moore. 我同意Andrew Moore的观点。 Using the DOM is much, much better. 使用DOM要好得多。 The HTML DOM images collection will return to you a reference to all image objects. HTML DOM图像集合将返回对所有图像对象的引用。

Let's say in your header you have, 让我们在你的标题中说,

<script type="text/javascript">
    function getFirstImageSource()
    {
        var img = document.images[0].src;
        return img;
    }
</script>

and then in your body you have, 然后在你的身体里,

<script type="text/javascript">
  alert(getFirstImageSource());
</script>

This will return the 1st image source. 这将返回第一个图像源。 You can also loop through them along the lines of, (in head section) 你也可以沿着它们的方向循环它们(在头部)

function getAllImageSources()
    {
        var returnString = "";
        for (var i = 0; i < document.images.length; i++)
        {
            returnString += document.images[i].src + "\n"
        }
        return returnString;
    }

(in body) (在体内)

<script type="text/javascript">
  alert(getAllImageSources());
</script>

If you're using JavaScript to do this, remember that you can't run your function looping through the images collection in your header. 如果您使用JavaScript执行此操作,请记住您无法在标题中的图像集合中循环运行函数。 In other words, you can't do something like this, 换句话说,你不能做这样的事情,

<script type="text/javascript">
    function getFirstImageSource()
    {
        var img = document.images[0].src;
        return img;
    }
    window.onload = getFirstImageSource;  //bad function

</script>

because this won't work. 因为这不行。 The images haven't loaded when the header is executed and thus you'll get a null result. 执行标头时未加载图像,因此您将获得null结果。

Hopefully this can help in some way. 希望这可以在某种程度上有所帮助。 If possible, I'd make use of the DOM. 如果可能的话,我会使用DOM。 You'll find that a good deal of your work is already done for you. 你会发现你已经完成了大量的工作。

I don't know if you MUST use regex to get your results. 我不知道你是否必须使用正则表达式来获得结果。 If not, you could try out simpleXML and XPath, which would be much more reliable for your goal: 如果没有,您可以尝试使用simpleXML和XPath,这对您的目标来说更可靠:

First, import the HTML into a DOM Document Object. 首先,将HTML导入DOM文档对象。 If you get errors, turn errors off for this part and be sure to turn them back on afterward: 如果您收到错误,请关闭此部分的错误,并确保之后重新打开它们:

 $dom = new DOMDocument();
 $dom -> loadHTMLFile("filename.html");

Next, import the DOM into a simpleXML object, like so: 接下来,将DOM导入simpleXML对象,如下所示:

 $xml = simplexml_import_dom($dom);

Now you can use a few methods to get all of your image elements (and their attributes) into an array. 现在,您可以使用一些方法将所有图像元素(及其属性)放入数组中。 XPath is the one I prefer, because I've had better luck with traversing the DOM with it: XPath是我喜欢的,因为我用它遍历DOM会有更好的运气:

 $images = $xml -> xpath('//img/@src');

This variable now can treated like an array of your image URLs: 此变量现在可以像图像URL的数组一样对待:

 foreach($images as $image) {
    echo '<img src="$image" /><br />
    ';
  }

Presto, all of your images, none of the fat. Presto,你的所有图像,没有脂肪。

Here's the non-annotated version of the above: 这是上面的非注释版本:


 $dom = new DOMDocument();
 $dom -> loadHTMLFile("filename.html");

 $xml = simplexml_import_dom($dom);

 $images = $xml -> xpath('//img/@src');

 foreach($images as $image) {
    echo '<img src="$image" /><br />
    ';
  }

I really think you can not predict all the cases with on regular expression. 我真的认为你不能用正则表达式预测所有的情况。

The best way is to use the DOM with the PHP5 class DOMDocument and xpath. 最好的方法是使用带有PHP5类DOMDocument和xpath的DOM。 It's the cleanest way to do what you want. 这是做你想做的最干净的方法。

$dom = new DOMDocument();
$dom->loadHTML( $htmlContent );
$xml = simplexml_import_dom($dom);
$images = $xml -> xpath('//img/@src');

You can try this: 你可以试试这个:

preg_match_all("/<img\s+src=\"(.+)\"/i", $html, $matches);
foreach ($matches as $key=>$value) {
    echo $key . ", " . $value . "<br>";
}

since you're not worrying about validating the HTML, you might try using strip_tags() on the text first to clear out most of the cruft. 既然你不担心验证HTML,你可能首先尝试在文本上使用strip_tags()来清除大部分内容。

Then you can search for an expression like 然后你可以搜索像这样的表达式

"/\<img .+ \/\>/i"

The backslashes escape special characters like <,>,/. 反斜杠会转义特殊字符,如<,>,/。 .+ insists that there be 1 or more of any character inside the img tag You can capture part of the expression by putting parentheses around it. 。+坚持img标签中有任何一个或多个字符您可以通过在括号周围加上括号来捕获部分表达式。 eg (.+) captures the middle part of the img tag. 例如(。+)捕获img标签的中间部分。

When you decide what part of the middle you wish specifically to capture, you can modify the (.+) to something more specific. 当您确定要特定捕获的中间部分时,可以将(。+)修改为更具体的内容。

<?php    
/* PHP Simple HTML DOM Parser @ http://simplehtmldom.sourceforge.net */

require_once('simple_html_dom.php');

$html = file_get_html('http://example.com');
$image = $html->find('img')[0]->src;

echo "<img src='{$image}'/>"; // BOOM!

PHP Simple HTML DOM Parser will do the job in few lines of code. PHP简单的HTML DOM解析器将在几行代码中完成工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM