简体   繁体   中英

How can I extract multiple HTML tags from a MySQL table

I have a table in a MySQL based CMS, one of whose fields contains the text of articles displayed in the CMS web pages.

Some of the articles contain images embedded in the text, in the form of HTML 'img' tags. There may be one or several images in the text contained in the field.

What I want to do is to create a query that will extract a list of all the images in all the articles. I have managed to create some code as follows:

SELECT nid, 
substr(body,locate('<img', body),(locate('>',body,locate('<img', body)) - locate('<img', body))) as image,
body FROM `node_revisions` where body like '%<img%'

and this seems to work ok, however of course it only extracts the first image and I would really like to extract all of them (in fact of course this would generally mean using a loop but that doesn't seem possible in MySQL).

Just for reference, the CMS in question is Drupal 6, hence the names of the fields and table. However, this is really a question about MySQL not Drupal which is why I'm asking here not on the Drupal Stackexchange site.

You will drive yourself insane trying to use locate(), substring(), or regular expressions to parse HTML or XML. See https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

I suggest you use PHP's DOMDocument class:

<?php

$bodyHtml = "now is the time for all <img src='good.jpg'> men to come to the <img src='aid.jpg'> of their country";

$dom = new DOMDocument();
$dom->loadHTML($bodyHtml);
$imgs = $dom->getElementsByTagName("img");
foreach ($imgs as $img) {
        print "$img->nodeName\n";
        foreach ($img->attributes as $attr) {
                print "  $attr->name=$attr->value\n";
        }
}

Outputs:

img
  src=good.jpg
img
  src=aid.jpg

Parsing html with regex is never 100%, you'll never feel confident you've got every image and correctly formatted,

The other problem you have is one you hinted at in your question. you have one record in node_revisions that may contain 1, or 2 or 10,000 images. There is no way in SQL you can return each image as a new row in your query results so you'd have to to return each image as a new column.

Meaning you would literally manually need to specify each column by hand:

SELECT code_to_return_img_1 as url1
      ,code_to_return_img_2 as url2
      ,code_to_return_img_3 as url3
      ,code_to_return_img_4 as url4
      ,code_to_return_img_5 as url5
      ,code_to_return_img_6 as url6
      ....
      and so on

If you knew there would only be less than, say 20 images per article and you didn't have php/java/python at your disposal and it was just a one off hack job you needed then you could do it with regex and SQL but your 30 minute job could turn into a 2 day job and a burst vein.

If Java is an option: https://jsoup.org/

If Python is an option: https://docs.python.org/2/library/htmlparser.html

If PHP is an option: http://htmlparsing.com/php.html

$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
    $imgurl = $image->getAttribute('src');
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM