简体   繁体   中英

Scraping using PHP + SimpleXML… I can grab images but not raw text?

I'm trying to grab a specific bit of raw text from a web site. Using this site and other sources, I learned how to grab specific images using simpleXML and xpath.

However the same approach doesn't appear to be working for grabbing raw text. Here's what's NOT working right now.

// first I set the xpath of the div that contains the text I want
$xpath = '//*[@id="storyCommentCountNumber"]';

// then I create a new DOM Document
$html = new DOMDocument();

// then I fetch the file and parse it (@ suppresses warnings).
@$html->loadHTMLFile($url);

// then convert DOM to SimpleXML
$xml = simplexml_import_dom($html);   

// run an XPath query on the div I want using the previously set xpath
$commcount = $xml->xpath($xpath);
print_r($commcount);

Now when I'm grabbing an image, that commcount object would return an array that contains the images source in it somewhere.

In this case, I want that object to return the raw text contained in the "storyCommentCountNumber" div. But that text doesn't appear to be contained in the object, just the name of the Div.

What am I doing wrong? I can kind of see that this approach is only for grabbing HTML elements and the bits inside of them, not raw text. How do I get the text inside that div?

Thanks!

One thing to note, is that when you are using print_r or var_dump on SimpleXML objects you won't see the "text" of the object (or sometimes the attributes). So to see everything you should output full XML string using $variable->AsXml().

And to get the text you need to cast the SimpleXml object to a string. This automatically pulls out the innerText.

 /* remember $commcount is always an array from the xpath */
 foreach($commcount as $str)
 {
     echo (string)$str;
 }

Hopefully the above can give you a start.

Can you include a sample of the HTML (including maybe a few lines before and after the element you are selecting?) and the output from print_r()?

You might try the following to see if it helps you out:

if ( count($commcount) > 0 ) {
    $divContent = $commcount[0]->asXml();
    print $divContent;
}

我知道您正在尝试使用SimpleXML,但我认为使用正则表达式会更容易获取原始文本。

Try checking this page out.

:)

The raw text inside the div is not part of the div element itself, rather it is part of the first child node of the div element. There should be a text node within the div that contains the data you are looking for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM