简体   繁体   中英

php xpath to extract tab content from Joomla / hikashop descriptions

[TL;DR] Need to parse the html to extract the tabs and content using PHP

I am migrating data from a Joomla / Hikashop site exported via a CSV file. The tabs are defined by content within a P tag as follows

<p> </p>
<p style="text-align: center;"><span style="text-decoration: underline;"><span style="font-size: 14pt;"><strong>Strong Item</strong></span></span></p>
<p> {tab=Description}</p>
<p>This is a default description</p>
<ul>
<li>It has</li>
<li>mixed content</li>
</ul>
<p>{tab=Features} </p>
<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>
<p>{/tabs}</p>

I need to extract the tab name followed by the content.

I can pull out the tabs easy enough

$crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {

But it's getting the content between tabs that is throwing me.

Description =

<ul>
<li>It has</li>
<li>mixed content</li>
</ul>

Features=

<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>

Obviously I can regex it and loop through lines etc.. but that is prone to error

Thanks

Thanks to mickmackusa for the links which helped put the piece of the puzzle together.

Using the links, I was able to get the content between each tab opening

<p>{tabs=newtab}</p>

My process was to clean the HTML with tidy, then load it into a new DOMDocument.

use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('E:\Dropbox\laragon\www\scrape\description.txt');
$config = array(
    'indent'         => true,
    'output-xhtml'   => true,
    'show-body-only' => true,
    'drop-empty-paras' => true,
    'wrap'           => 1200
);

// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

$doc = new DOMDocument;
$doc->loadHTML($tidy->value);



$crawler = new Crawler($doc);

The closing tag of the tabs is

<p>{/tabs}</p>

This did not match the code I had and meant it needed some additional processing. As this is a one off project I did a quick fix.

So I crawled the page and added a new paragraph element just BEFORE the closing tabs section. It looks for /tabs within the paragraph, then adds in effect a new tab section with no content.

$crawler
    ->filterXpath('//p[text()[contains(.,"/tabs")]]')
    ->each(function (Crawler $crawler) use ($doc) {
        foreach ($crawler as $node) {
            $span = $doc->createElement('p', '{tab=end}');
            $node->parentNode->insertBefore($span, $node);
        }
    });

This results in the HTML

<p>{tab=end}</p>
<p>{/tabs}</p>

Now I take the edited html provided from $crawler->html() and look for each tab section (starting with < p>{tab=TABNAME}</p> and ending in <p>{tab=NEXTTABNAME}</p> )

I first get the headings

$tab_headings = $crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {
    $matches = [];
    $pattern = '/\{tab=(.*)\}/m';

    if (preg_match($pattern, $node->text(), $matches)) {
        $tab = $matches[1];
    };

    return $tab;
});

I remove the last one (the dummy one I added)

array_pop($tab_headings);

I can now loop through and extract the html, I am using Laravel hence the use of dump

$tab_count = 0;
foreach ($tab_headings as $tab) {
    dump($tab_headings[$tab_count]);
    $first = $tab_count + 1;
    $next = $tab_count + 2;
    /**
     * Get content between tabs
     */
    $tab_content = $crawler
        ->filterXpath('//p[text()[contains(.,"tab=")]][' . $first . ']/following-sibling::*
        [
        count(.|//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        =
        count(//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        ]')
        ->each(function ($node) {
            return $node->outerHtml();
        });

    $tab_count++;

    dump($tab_content);
}

I now insert into the database etc..

The links that helped the most

XPath select all elements between two specific elements

XPath: how to select following siblings until a certain sibling

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM