簡體   English   中英

如何使用simplehtmldom從這個頁面中提取數據

[英]How to use simplehtmldom to extract data from this page

我正在嘗試使用 simplehtmldom 從https://benthamopen.com/browse-by-title/B/1/中提取信息。

具體來說,我想訪問頁面中顯示的部分:

<div style="padding:10px;">
<strong>ISSN: </strong>1874-1207<br><div class="sharethis-inline-share-buttons" style="padding-top:10px;" data-url="https://benthamopen.com/TOBEJ/home/" data-title="The Open Biomedical Engineering Journal"></div>
</div>

我有這個代碼:

$html = file_get_html('https://benthamopen.com/browse-by-title/B/1/');

foreach($html->find('div[style=padding:10px;]') as $ele) {
    echo("<pre>".print_r($ele,true)."</pre>");
}

...返回(我只顯示頁面中的一項)

simplehtmldom\HtmlNode Object
(
    [nodetype] => HDOM_TYPE_ELEMENT (1)
    [tag] => div
    [attributes] => Array
        (
            [style] => padding:10px;
        )

    [nodes] => Array
        (
            [0] => simplehtmldom\HtmlNode Object
                (
                    [nodetype] => HDOM_TYPE_ELEMENT (1)
                    [tag] => strong
                    [attributes] => none
                    [nodes] => none
                )

            [1] => simplehtmldom\HtmlNode Object
                (
                    [nodetype] => HDOM_TYPE_TEXT (3)
                    [tag] => text
                    [attributes] => none
                    [nodes] => none
                )

            [2] => simplehtmldom\HtmlNode Object
                (
                    [nodetype] => HDOM_TYPE_ELEMENT (1)
                    [tag] => br
                    [attributes] => none
                    [nodes] => none
                )

            [3] => simplehtmldom\HtmlNode Object
                (
                    [nodetype] => HDOM_TYPE_ELEMENT (1)
                    [tag] => div
                    [attributes] => Array
                        (
                            [class] => sharethis-inline-share-buttons
                            [style] => padding-top:10px;
                            [data-url] => https://benthamopen.com/TOBEJ/home/
                            [data-title] => The Open Biomedical Engineering Journal
                        )

                    [nodes] => none
                )

        )

)

我不確定如何從這里開始。 我想提取:

  • ISSN 文本(未在 echo 語句中顯示 - 不知道為什么)[上例中的 1874-1207]。 它是 [nodes] 的元素零
  • 'data-url' [https://benthamopen.com/TOBEJ/home/,在上面的例子中]
  • 'data-title' [The Open Biomedical Engineering Journal,在上面的例子中]

或許我對 PHP 對象和 arrays 的理解不如應有的好,也不知道為什么 ISSN 沒有顯示在 echo 語句中。

我嘗試了各種(很多)事情,但只是努力從元素中提取數據。

我不熟悉 simplehtmldom,除了知道要避免它。 因此,我將介紹一個使用 PHP 內置 DOM 類的解決方案:

<?php
libxml_use_internal_errors(true);
// get the HTML
$html = file_get_contents("https://benthamopen.com/browse-by-title/B/1/");

// create a DOM object and load it up
$dom = new DomDocument();
$dom->loadHtml($html);

// create an XPath object and query it
$xpath = new DomXPath($dom);
$elements = $xpath->query("//div[@style='padding:10px;']");

// loop through the matches
foreach ($elements as $el) {
    // skip elements without ISSN
    $text = trim($el->textContent);
    if (strpos($text, "ISSN") !== 0) {
        continue;
    }
    // get the first div inside this thing
    $div = $el->getElementsByTagName("div")[0];
    // dump it out
    printf("%s %s %s<br/>\n", str_replace("ISSN: ", "", $text), $div->getAttribute("data-title"), $div->getAttribute("data-url"));
}

XPath 的東西可能有點壓倒性,但對於像這樣的簡單搜索,它與 CSS 選擇器沒有太大區別。 希望評論能解釋一切,如果沒有,請告訴我!

Output:

1874-1207 The Open Biomedical Engineering Journal https://benthamopen.com/TOBEJ/home/<br/>
1874-1967 The Open Biology Journal https://benthamopen.com/TOBIOJ/home/<br/>
1874-091X The Open Biochemistry Journal https://benthamopen.com/TOBIOCJ/home/<br/>
1875-0362 The Open Bioinformatics Journal https://benthamopen.com/TOBIOIJ/home/<br/>
1875-3183 The Open Biomarkers Journal https://benthamopen.com/TOBIOMJ/home/<br/>
2665-9956 The Open Biomaterials Science Journal https://benthamopen.com/TOBMSJ/home/<br/>
1874-0707 The Open Biotechnology Journal https://benthamopen.com/TOBIOTJ/home/<br/>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM