php xpath 從 Joomla / hikashop 描述中提取標簽內容

Question

[TL;DR] 需要解析html以使用PHP提取標簽和內容

我正在從通過 CSV 文件導出的 Joomla/Hikashop 站點遷移數據。 選項卡由 P 標簽內的內容定義，如下所示

<p> </p>
<p style="text-align: center;"><span style="text-decoration: underline;"><span style="font-size: 14pt;"><strong>Strong Item</strong></span></span></p>
<p> {tab=Description}</p>
<p>This is a default description</p>
<ul>
<li>It has</li>
<li>mixed content</li>
</ul>
<p>{tab=Features} </p>
<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>
<p>{/tabs}</p>

我需要提取標簽名稱后跟內容。

我可以很容易地拉出標簽

$crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {

但它在讓我感到厭煩的標簽之間獲取內容。

說明 =

<ul>
<li>It has</li>
<li>mixed content</li>
</ul>

特點=

<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>

顯然我可以正則表達式它並遍歷行等..但這很容易出錯

謝謝

Answer 1

感謝 mickmackusa 提供的鏈接，這些鏈接幫助我們拼湊了這塊拼圖。

使用鏈接，我能夠獲得每個選項卡打開之間的內容

<p>{tabs=newtab}</p>

我的過程是用 tidy 清理 HTML，然后將其加載到新的 DOMDocument 中。

use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('E:\Dropbox\laragon\www\scrape\description.txt');
$config = array(
    'indent'         => true,
    'output-xhtml'   => true,
    'show-body-only' => true,
    'drop-empty-paras' => true,
    'wrap'           => 1200
);

// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

$doc = new DOMDocument;
$doc->loadHTML($tidy->value);



$crawler = new Crawler($doc);

選項卡的結束標記是

<p>{/tabs}</p>

這與我擁有的代碼不匹配，意味着它需要一些額外的處理。 由於這是一個一次性項目，我做了一個快速修復。

所以我抓取了頁面並在關閉標簽部分之前添加了一個新的段落元素。 它在段落中查找 /tabs，然后實際上添加了一個沒有內容的新選項卡部分。

$crawler
    ->filterXpath('//p[text()[contains(.,"/tabs")]]')
    ->each(function (Crawler $crawler) use ($doc) {
        foreach ($crawler as $node) {
            $span = $doc->createElement('p', '{tab=end}');
            $node->parentNode->insertBefore($span, $node);
        }
    });

這導致 HTML

<p>{tab=end}</p>
<p>{/tabs}</p>

現在我走從$ crawler-> HTML（）和外觀對於每個片部分提供編輯的HTML（開始用< p>{tab=TABNAME}</p>和在結束<p>{tab=NEXTTABNAME}</p> )

我首先得到標題

$tab_headings = $crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {
    $matches = [];
    $pattern = '/\{tab=(.*)\}/m';

    if (preg_match($pattern, $node->text(), $matches)) {
        $tab = $matches[1];
    };

    return $tab;
});

我刪除了最后一個（我添加的虛擬的）

array_pop($tab_headings);

我現在可以循環並提取 html，我正在使用 Laravel 因此使用轉儲

$tab_count = 0;
foreach ($tab_headings as $tab) {
    dump($tab_headings[$tab_count]);
    $first = $tab_count + 1;
    $next = $tab_count + 2;
    /**
     * Get content between tabs
     */
    $tab_content = $crawler
        ->filterXpath('//p[text()[contains(.,"tab=")]][' . $first . ']/following-sibling::*
        [
        count(.|//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        =
        count(//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        ]')
        ->each(function ($node) {
            return $node->outerHtml();
        });

    $tab_count++;

    dump($tab_content);
}

我現在插入數據庫等。

幫助最大的鏈接

XPath 選擇兩個特定元素之間的所有元素

XPath：如何選擇以下兄弟姐妹直到某個兄弟姐妹

php xpath 從 Joomla / hikashop 描述中提取標簽內容

問題描述

1 個解決方案

解決方案1
0 2021-10-21 10:20:37

php xpath 從 Joomla / hikashop 描述中提取標簽內容

問題描述

1 個解決方案

解決方案1 0 2021-10-21 10:20:37

解決方案1
0 2021-10-21 10:20:37