简体   繁体   English

php xpath 从 Joomla / hikashop 描述中提取标签内容

[英]php xpath to extract tab content from Joomla / hikashop descriptions

[TL;DR] Need to parse the html to extract the tabs and content using PHP [TL;DR] 需要解析html以使用PHP提取标签和内容

I am migrating data from a Joomla / Hikashop site exported via a CSV file.我正在从通过 CSV 文件导出的 Joomla/Hikashop 站点迁移数据。 The tabs are defined by content within a P tag as follows选项卡由 P 标签内的内容定义,如下所示

<p> </p>
<p style="text-align: center;"><span style="text-decoration: underline;"><span style="font-size: 14pt;"><strong>Strong Item</strong></span></span></p>
<p> {tab=Description}</p>
<p>This is a default description</p>
<ul>
<li>It has</li>
<li>mixed content</li>
</ul>
<p>{tab=Features} </p>
<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>
<p>{/tabs}</p>

I need to extract the tab name followed by the content.我需要提取标签名称后跟内容。

I can pull out the tabs easy enough我可以很容易地拉出标签

$crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {

But it's getting the content between tabs that is throwing me.但它在让我感到厌烦的标签之间获取内容。

Description =说明 =

<ul>
<li>It has</li>
<li>mixed content</li>
</ul>

Features=特点=

<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>

Obviously I can regex it and loop through lines etc.. but that is prone to error显然我可以正则表达式它并遍历行等..但这很容易出错

Thanks谢谢

Thanks to mickmackusa for the links which helped put the piece of the puzzle together.感谢 mickmackusa 提供的链接,这些链接帮助我们拼凑了这块拼图。

Using the links, I was able to get the content between each tab opening使用链接,我能够获得每个选项卡打开之间的内容

<p>{tabs=newtab}</p>

My process was to clean the HTML with tidy, then load it into a new DOMDocument.我的过程是用 tidy 清理 HTML,然后将其加载到新的 DOMDocument 中。

use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('E:\Dropbox\laragon\www\scrape\description.txt');
$config = array(
    'indent'         => true,
    'output-xhtml'   => true,
    'show-body-only' => true,
    'drop-empty-paras' => true,
    'wrap'           => 1200
);

// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

$doc = new DOMDocument;
$doc->loadHTML($tidy->value);



$crawler = new Crawler($doc);

The closing tag of the tabs is选项卡的结束标记是

<p>{/tabs}</p>

This did not match the code I had and meant it needed some additional processing.这与我拥有的代码不匹配,意味着它需要一些额外的处理。 As this is a one off project I did a quick fix.由于这是一个一次性项目,我做了一个快速修复。

So I crawled the page and added a new paragraph element just BEFORE the closing tabs section.所以我抓取了页面并在关闭标签部分之前添加了一个新的段落元素。 It looks for /tabs within the paragraph, then adds in effect a new tab section with no content.它在段落中查找 /tabs,然后实际上添加了一个没有内容的新选项卡部分。

$crawler
    ->filterXpath('//p[text()[contains(.,"/tabs")]]')
    ->each(function (Crawler $crawler) use ($doc) {
        foreach ($crawler as $node) {
            $span = $doc->createElement('p', '{tab=end}');
            $node->parentNode->insertBefore($span, $node);
        }
    });

This results in the HTML这导致 HTML

<p>{tab=end}</p>
<p>{/tabs}</p>

Now I take the edited html provided from $crawler->html() and look for each tab section (starting with < p>{tab=TABNAME}</p> and ending in <p>{tab=NEXTTABNAME}</p> )现在我走从$ crawler-> HTML()和外观对于每个片部分提供编辑的HTML(开始用< p>{tab=TABNAME}</p>和在结束<p>{tab=NEXTTABNAME}</p> )

I first get the headings我首先得到标题

$tab_headings = $crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {
    $matches = [];
    $pattern = '/\{tab=(.*)\}/m';

    if (preg_match($pattern, $node->text(), $matches)) {
        $tab = $matches[1];
    };

    return $tab;
});

I remove the last one (the dummy one I added)我删除了最后一个(我添加的虚拟的)

array_pop($tab_headings);

I can now loop through and extract the html, I am using Laravel hence the use of dump我现在可以循环并提取 html,我正在使用 Laravel 因此使用转储

$tab_count = 0;
foreach ($tab_headings as $tab) {
    dump($tab_headings[$tab_count]);
    $first = $tab_count + 1;
    $next = $tab_count + 2;
    /**
     * Get content between tabs
     */
    $tab_content = $crawler
        ->filterXpath('//p[text()[contains(.,"tab=")]][' . $first . ']/following-sibling::*
        [
        count(.|//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        =
        count(//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        ]')
        ->each(function ($node) {
            return $node->outerHtml();
        });

    $tab_count++;

    dump($tab_content);
}

I now insert into the database etc..我现在插入数据库等。

The links that helped the most帮助最大的链接

XPath select all elements between two specific elements XPath 选择两个特定元素之间的所有元素

XPath: how to select following siblings until a certain sibling XPath:如何选择以下兄弟姐妹直到某个兄弟姐妹

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM