简体   繁体   中英

PHP Xpath get both a href and text node

I have a table that contains a number of headings like this:

<TR>
<TH CLASS="ddtitle" scope="colgroup" ><A HREF="http://foo.com">Linked text</A></TH>
</TR>

The table is thousands of lines long so I can't share in full, but here is the initial tag and one full item within the table. Sadly there is no nested wrapped around each item and the comments are mine - so it's a pain to decipher where one item begins and ends.

<TABLE  CLASS="datadisplaytable" SUMMARY="Layout table" width="100%"><CAPTION class="captiontext">Items Found</CAPTION>
<!-- START of first item in the table -->
<TR>
<TH CLASS="ddtitle" scope="colgroup" ><A HREF="http://foo.com">Linked text</A></TH>
</TR>
<TR>
<TD CLASS="dddefault">
<SPAN class="fieldlabeltext">Term: </SPAN>Fall
<BR>
<SPAN class="fieldlabeltext">Registration: </SPAN>Jan 1, 2018 to Aug 1, 2018 
<BR>
<SPAN class="fieldlabeltext">Levels: </SPAN>Undergraduate 
<BR>
<BR>
Location
<BR>
Lecture Schedule Type
<BR>
       3.000 Credits
<BR>
<A HREF="foo">View Entry</A>
<BR>
<BR>
<TABLE  CLASS="datadisplaytable" SUMMARY="Meeting time table"><CAPTION class="captiontext">Scheduled Meeting Times</CAPTION>
<TR>
<TH CLASS="ddheader" scope="col" >Type</TH>
<TH CLASS="ddheader" scope="col" >Time</TH>
<TH CLASS="ddheader" scope="col" >Days</TH>
<TH CLASS="ddheader" scope="col" >Where</TH>
<TH CLASS="ddheader" scope="col" >Date Range</TH>
<TH CLASS="ddheader" scope="col" >Schedule Type</TH>
<TH CLASS="ddheader" scope="col" >Instructors</TH>
</TR>
<TR>
<TD CLASS="dddefault">Lecture</TD>
<TD CLASS="dddefault">9:20 am - 10:10 am</TD>
<TD CLASS="dddefault">MWF</TD>
<TD CLASS="dddefault">Some Building Room 101</TD>
<TD CLASS="dddefault">Aug 1, 2018 - Dec 1, 2018</TD>
<TD CLASS="dddefault">Lecture</TD>
<TD CLASS="dddefault">Instructor Name (<ABBR title= "Primary">P</ABBR>)<A HREF="mailto:email@foo.com"    target="Instructur Name" ><IMG SRC="/wtlgifs/email.png" ALIGN="middle" ALT="E-mail" CLASS="headerImg" TITLE="E-mail"  NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=16 WIDTH=16></A></TD>
</TR>
</TABLE>
<BR>
<BR>
</TD>
</TR>
<!-- END first item in the table -->

I want to extract the item details, starting with the course name (which is the text content, "linked text," inside th.ddtitle) and the course link (which is the a href inside th.ddtitle). Here's what I've tried for grabbing those two items:

$dom = new DOMDocument();
$myHtml = file_get_contents(__DIR__.'myfile.html');
$dom->loadHTML($myHtml);
$xpath = new DOMXPath($dom);
// first part changes an outer table with the same class, so I can get inner tables without the outer one
$tables = $xpath->query("//table[@class='datadisplaytable']");
for($i=0; $i<1; $i++) {
    $tables[$i]->setAttribute('class', 'masterTable');
}
$html = $dom->saveHTML();
// now, the query I'm having trouble with:
$textAndLink = $xpath->query("//th[@class='ddtitle']/*");
$i=1;
foreach($textAndLink as $info) {
    foreach($info->childNodes as $child) {
        if($i%2 == 0) {
            echo $child->getAttribute('href') . '<br>';
        } else {
            echo $child->nodeValue . '<br>';
        }
    }
    $i++;
}

I've also tried print_r($child) and the only items displayed are the text nodes, no <a> tags. How can I get both the anchor's "href" attribute and the text content? What I am expecting from the code above is a list like this:

http://foo.com/<br>
Linked text<br>
http://foo.com/secondlink<br>
Second linked text<br>

and so on and so forth.

Try this code snippet here

<?php

ini_set('display_errors', 1);
$string = '
<TABLE  CLASS="datadisplaytable" SUMMARY="Layout table" width="100%"><CAPTION class="captiontext">Items Found</CAPTION>
<!-- START of first item in the table -->
<TR>
<TH CLASS="ddtitle" scope="colgroup" ><A HREF="http://foo.com">Linked text</A></TH>
</TR>
<TR>
<TD CLASS="dddefault">
<SPAN class="fieldlabeltext">Term: </SPAN>Fall
<BR>
<SPAN class="fieldlabeltext">Registration: </SPAN>Jan 1, 2018 to Aug 1, 2018 
<BR>
<SPAN class="fieldlabeltext">Levels: </SPAN>Undergraduate 
<BR>
<BR>
Location
<BR>
Lecture Schedule Type
<BR>
       3.000 Credits
<BR>
<A HREF="foo">View Entry</A>
<BR>
<BR>
<TABLE  CLASS="datadisplaytable" SUMMARY="Meeting time table"><CAPTION class="captiontext">Scheduled Meeting Times</CAPTION>
<TR>
<TH CLASS="ddheader" scope="col" >Type</TH>
<TH CLASS="ddheader" scope="col" >Time</TH>
<TH CLASS="ddheader" scope="col" >Days</TH>
<TH CLASS="ddheader" scope="col" >Where</TH>
<TH CLASS="ddheader" scope="col" >Date Range</TH>
<TH CLASS="ddheader" scope="col" >Schedule Type</TH>
<TH CLASS="ddheader" scope="col" >Instructors</TH>
</TR>
<TR>
<TD CLASS="dddefault">Lecture</TD>
<TD CLASS="dddefault">9:20 am - 10:10 am</TD>
<TD CLASS="dddefault">MWF</TD>
<TD CLASS="dddefault">Some Building Room 101</TD>
<TD CLASS="dddefault">Aug 1, 2018 - Dec 1, 2018</TD>
<TD CLASS="dddefault">Lecture</TD>
<TD CLASS="dddefault">Instructor Name (<ABBR title= "Primary">P</ABBR>)<A HREF="mailto:email@foo.com"    target="Instructur Name" ><IMG SRC="/wtlgifs/email.png" ALIGN="middle" ALT="E-mail" CLASS="headerImg" TITLE="E-mail"  NAME="web_email" HSPACE=0 VSPACE=0 BORDER=0 HEIGHT=16 WIDTH=16></A></TD>
</TR>
</TABLE>
<BR>
<BR>
</TD>
</TR>';

$domDocument = new DOMDocument();
$domDocument->loadHTML($string);

$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query('//tr/th[@class="ddtitle"]/a');
foreach($results as $result)
{
    print_r($result->textContent);
    print_r($result->getAttribute("href"));
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM