php xpath 从 Joomla / hikashop 描述中提取标签内容

Question

[TL;DR] Need to parse the html to extract the tabs and content using PHP [TL;DR] 需要解析html以使用PHP提取标签和内容

I am migrating data from a Joomla / Hikashop site exported via a CSV file.我正在从通过 CSV 文件导出的 Joomla/Hikashop 站点迁移数据。 The tabs are defined by content within a P tag as follows选项卡由 P 标签内的内容定义，如下所示

<p> </p>
<p style="text-align: center;"><span style="text-decoration: underline;"><span style="font-size: 14pt;"><strong>Strong Item</strong></span></span></p>
<p> {tab=Description}</p>
<p>This is a default description</p>
<ul>
<li>It has</li>
<li>mixed content</li>
</ul>
<p>{tab=Features} </p>
<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>
<p>{/tabs}</p>

I need to extract the tab name followed by the content.我需要提取标签名称后跟内容。

I can pull out the tabs easy enough我可以很容易地拉出标签

$crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {

But it's getting the content between tabs that is throwing me.但它在让我感到厌烦的标签之间获取内容。

Description =说明 =

<ul>
<li>It has</li>
<li>mixed content</li>
</ul>

Features=特点=

<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>

Obviously I can regex it and loop through lines etc.. but that is prone to error显然我可以正则表达式它并遍历行等..但这很容易出错

Thanks谢谢

Answer 1

Thanks to mickmackusa for the links which helped put the piece of the puzzle together.感谢 mickmackusa 提供的链接，这些链接帮助我们拼凑了这块拼图。

Using the links, I was able to get the content between each tab opening使用链接，我能够获得每个选项卡打开之间的内容

<p>{tabs=newtab}</p>

My process was to clean the HTML with tidy, then load it into a new DOMDocument.我的过程是用 tidy 清理 HTML，然后将其加载到新的 DOMDocument 中。

use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('E:\Dropbox\laragon\www\scrape\description.txt');
$config = array(
    'indent'         => true,
    'output-xhtml'   => true,
    'show-body-only' => true,
    'drop-empty-paras' => true,
    'wrap'           => 1200
);

// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

$doc = new DOMDocument;
$doc->loadHTML($tidy->value);



$crawler = new Crawler($doc);

The closing tag of the tabs is选项卡的结束标记是

<p>{/tabs}</p>

This did not match the code I had and meant it needed some additional processing.这与我拥有的代码不匹配，意味着它需要一些额外的处理。 As this is a one off project I did a quick fix.由于这是一个一次性项目，我做了一个快速修复。

So I crawled the page and added a new paragraph element just BEFORE the closing tabs section.所以我抓取了页面并在关闭标签部分之前添加了一个新的段落元素。 It looks for /tabs within the paragraph, then adds in effect a new tab section with no content.它在段落中查找 /tabs，然后实际上添加了一个没有内容的新选项卡部分。

$crawler
    ->filterXpath('//p[text()[contains(.,"/tabs")]]')
    ->each(function (Crawler $crawler) use ($doc) {
        foreach ($crawler as $node) {
            $span = $doc->createElement('p', '{tab=end}');
            $node->parentNode->insertBefore($span, $node);
        }
    });

This results in the HTML这导致 HTML

<p>{tab=end}</p>
<p>{/tabs}</p>

Now I take the edited html provided from $crawler->html() and look for each tab section (starting with < p>{tab=TABNAME} and ending in {tab=NEXTTABNAME} )现在我走从$ crawler-> HTML（）和外观对于每个片部分提供编辑的HTML（开始用< p>{tab=TABNAME}和在结束{tab=NEXTTABNAME} )

I first get the headings我首先得到标题

$tab_headings = $crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {
    $matches = [];
    $pattern = '/\{tab=(.*)\}/m';

    if (preg_match($pattern, $node->text(), $matches)) {
        $tab = $matches[1];
    };

    return $tab;
});

I remove the last one (the dummy one I added)我删除了最后一个（我添加的虚拟的）

array_pop($tab_headings);

I can now loop through and extract the html, I am using Laravel hence the use of dump我现在可以循环并提取 html，我正在使用 Laravel 因此使用转储

$tab_count = 0;
foreach ($tab_headings as $tab) {
    dump($tab_headings[$tab_count]);
    $first = $tab_count + 1;
    $next = $tab_count + 2;
    /**
     * Get content between tabs
     */
    $tab_content = $crawler
        ->filterXpath('//p[text()[contains(.,"tab=")]][' . $first . ']/following-sibling::*
        [
        count(.|//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        =
        count(//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        ]')
        ->each(function ($node) {
            return $node->outerHtml();
        });

    $tab_count++;

    dump($tab_content);
}

I now insert into the database etc..我现在插入数据库等。

The links that helped the most帮助最大的链接

XPath select all elements between two specific elements XPath 选择两个特定元素之间的所有元素

XPath: how to select following siblings until a certain sibling XPath：如何选择以下兄弟姐妹直到某个兄弟姐妹

php xpath 从 Joomla / hikashop 描述中提取标签内容

问题描述

1 个解决方案

解决方案1
0 2021-10-21 10:20:37

php xpath 从 Joomla / hikashop 描述中提取标签内容

问题描述

1 个解决方案

解决方案1 0 2021-10-21 10:20:37

解决方案1
0 2021-10-21 10:20:37