如何从MySQL表中提取多个HTML标记

Question

我在基于MySQL的CMS中有一个表，其中一个字段包含在CMS网页中显示的文章文本。

一些文章包含以HTML'img'标签形式嵌入文本的图像。 该字段中的文本中可能有一个或几个图像。

我想做的是创建一个查询，该查询将提取所有文章中所有图像的列表。 我设法创建了一些代码，如下所示：

SELECT nid, 
substr(body,locate('<img', body),(locate('>',body,locate('<img', body)) - locate('<img', body))) as image,
body FROM `node_revisions` where body like '%<img%'

而且这似乎行得通，但是当然它只提取第一个图像，而我真的很想提取所有图像（实际上，这通常意味着要使用循环，但在MySQL中似乎不可能）。

仅供参考，有问题的CMS为Drupal 6，因此为字段和表的名称。 但是，这实际上是关于MySQL而不是Drupal的问题，这就是为什么我在这里没有在Drupal Stackexchange网站上问这个问题。

Answer 1

您将疯狂地尝试使用locate（），substring（）或正则表达式来解析HTML或XML。 参见https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

我建议您使用PHP的DOMDocument类：

<?php

$bodyHtml = "now is the time for all <img src='good.jpg'> men to come to the <img src='aid.jpg'> of their country";

$dom = new DOMDocument();
$dom->loadHTML($bodyHtml);
$imgs = $dom->getElementsByTagName("img");
foreach ($imgs as $img) {
        print "$img->nodeName\n";
        foreach ($img->attributes as $attr) {
                print "  $attr->name=$attr->value\n";
        }
}

输出：

img
  src=good.jpg
img
  src=aid.jpg

Answer 2

使用regex解析html绝不会100％，您永远不会感到自己拥有每张图片并正确设置格式，

您遇到的另一个问题是您在问题中暗示的一个问题。 您在node_revisions中有一条记录，其中可能包含1张，2张或10,000张图像。 在SQL中，您无法在查询结果中将每个图像作为新行返回，因此您必须将每个图像作为新列返回。

这意味着您实际上需要手动指定每个列：

SELECT code_to_return_img_1 as url1
      ,code_to_return_img_2 as url2
      ,code_to_return_img_3 as url3
      ,code_to_return_img_4 as url4
      ,code_to_return_img_5 as url5
      ,code_to_return_img_6 as url6
      ....
      and so on

如果您知道每篇文章只有少于20张图片，并且没有php / java / python可供使用，而这只是您需要的一项hacker工作，那么您可以使用regex和SQL来完成，您30分钟的工作可能会变成2天的工作和破裂的脉搏。

如果可以选择使用Java： https ： //jsoup.org/

如果可以选择使用Python： https ： //docs.python.org/2/library/htmlparser.html

如果可以选择使用PHP： http : //htmlparsing.com/php.html

$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
    $imgurl = $image->getAttribute('src');
}

如何从MySQL表中提取多个HTML标记

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-08-05 19:18:56

解决方案2
0 2016-08-05 20:09:24

如何从MySQL表中提取多个HTML标记

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-08-05 19:18:56

解决方案2 0 2016-08-05 20:09:24

解决方案1
1 已采纳 2016-08-05 19:18:56

解决方案2
0 2016-08-05 20:09:24