php xpath 从 Joomla / hikashop 描述中提取标签内容

Question

[TL;DR] 需要解析html以使用PHP提取标签和内容

我正在从通过 CSV 文件导出的 Joomla/Hikashop 站点迁移数据。 选项卡由 P 标签内的内容定义，如下所示

<p> </p>
<p style="text-align: center;"><span style="text-decoration: underline;"><span style="font-size: 14pt;"><strong>Strong Item</strong></span></span></p>
<p> {tab=Description}</p>
<p>This is a default description</p>
<ul>
<li>It has</li>
<li>mixed content</li>
</ul>
<p>{tab=Features} </p>
<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>
<p>{/tabs}</p>

我需要提取标签名称后跟内容。

我可以很容易地拉出标签

$crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {

但它在让我感到厌烦的标签之间获取内容。

说明 =

<ul>
<li>It has</li>
<li>mixed content</li>
</ul>

特点=

<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>

显然我可以正则表达式它并遍历行等..但这很容易出错

谢谢

Answer 1

感谢 mickmackusa 提供的链接，这些链接帮助我们拼凑了这块拼图。

使用链接，我能够获得每个选项卡打开之间的内容

<p>{tabs=newtab}</p>

我的过程是用 tidy 清理 HTML，然后将其加载到新的 DOMDocument 中。

use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('E:\Dropbox\laragon\www\scrape\description.txt');
$config = array(
    'indent'         => true,
    'output-xhtml'   => true,
    'show-body-only' => true,
    'drop-empty-paras' => true,
    'wrap'           => 1200
);

// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

$doc = new DOMDocument;
$doc->loadHTML($tidy->value);



$crawler = new Crawler($doc);

选项卡的结束标记是

<p>{/tabs}</p>

这与我拥有的代码不匹配，意味着它需要一些额外的处理。 由于这是一个一次性项目，我做了一个快速修复。

所以我抓取了页面并在关闭标签部分之前添加了一个新的段落元素。 它在段落中查找 /tabs，然后实际上添加了一个没有内容的新选项卡部分。

$crawler
    ->filterXpath('//p[text()[contains(.,"/tabs")]]')
    ->each(function (Crawler $crawler) use ($doc) {
        foreach ($crawler as $node) {
            $span = $doc->createElement('p', '{tab=end}');
            $node->parentNode->insertBefore($span, $node);
        }
    });

这导致 HTML

<p>{tab=end}</p>
<p>{/tabs}</p>

现在我走从$ crawler-> HTML（）和外观对于每个片部分提供编辑的HTML（开始用< p>{tab=TABNAME}</p>和在结束<p>{tab=NEXTTABNAME}</p> )

我首先得到标题

$tab_headings = $crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {
    $matches = [];
    $pattern = '/\{tab=(.*)\}/m';

    if (preg_match($pattern, $node->text(), $matches)) {
        $tab = $matches[1];
    };

    return $tab;
});

我删除了最后一个（我添加的虚拟的）

array_pop($tab_headings);

我现在可以循环并提取 html，我正在使用 Laravel 因此使用转储

$tab_count = 0;
foreach ($tab_headings as $tab) {
    dump($tab_headings[$tab_count]);
    $first = $tab_count + 1;
    $next = $tab_count + 2;
    /**
     * Get content between tabs
     */
    $tab_content = $crawler
        ->filterXpath('//p[text()[contains(.,"tab=")]][' . $first . ']/following-sibling::*
        [
        count(.|//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        =
        count(//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        ]')
        ->each(function ($node) {
            return $node->outerHtml();
        });

    $tab_count++;

    dump($tab_content);
}

我现在插入数据库等。

帮助最大的链接

XPath 选择两个特定元素之间的所有元素

XPath：如何选择以下兄弟姐妹直到某个兄弟姐妹

php xpath 从 Joomla / hikashop 描述中提取标签内容

问题描述

1 个解决方案

解决方案1
0 2021-10-21 10:20:37

php xpath 从 Joomla / hikashop 描述中提取标签内容

问题描述

1 个解决方案

解决方案1 0 2021-10-21 10:20:37

解决方案1
0 2021-10-21 10:20:37