Don't render html when scraping content with PHP

Question

I'm working on a scraper to collect contact information for a marketing project, but I'm running into an issue with trying to organize the scraped data within my script. One of the biggest issues I'm running into is as follows:

<font attribute="something">

   <font otherattribute="somethingelse">

      <font otherattribute="onemore">

         Content of Interest

      </font>
   </font>
</font>

When trying to parse the DOM and scrape out the content of interest, my script looks for <font> within another <font> and saves all content it finds to an array as unique entries. The issue, however, is that I'm finding repeat entries within the array. I tried having the script check for quality between two successive entries before pushing them into the array, but I get results like the following when var_dump() is called on two entries that APPEAR equal, but are not considered equal by the script.

string(76) "Content of Interest" 
string(47) "Content of Interest"

My best guess is that the PHP script is rendering the HTML rather then treating each entry as the innertext of the HTML node. I want to only save a simple text version of the content pulled from each node.

How can I ensure that the string saved to the array is ONLY the text that I can see? Not rendered HTML, which contains parts that I can't see in my browser?

Answer 1

Use php functions like strip_tags() to receive your text without any HTML.

http://php.net/manual/en/function.strip-tags.php

Don't render html when scraping content with PHP

Question

1 answers

solution1
2 ACCPTED 2014-07-17 06:34:39

Don't render html when scraping content with PHP

Question

1 answers

solution1 2 ACCPTED 2014-07-17 06:34:39

solution1
2 ACCPTED 2014-07-17 06:34:39