简体   繁体   中英

Don't render html when scraping content with PHP

I'm working on a scraper to collect contact information for a marketing project, but I'm running into an issue with trying to organize the scraped data within my script. One of the biggest issues I'm running into is as follows:

<font attribute="something">

   <font otherattribute="somethingelse">

      <font otherattribute="onemore">

         Content of Interest

      </font>
   </font>
</font>

When trying to parse the DOM and scrape out the content of interest, my script looks for <font> within another <font> and saves all content it finds to an array as unique entries. The issue, however, is that I'm finding repeat entries within the array. I tried having the script check for quality between two successive entries before pushing them into the array, but I get results like the following when var_dump() is called on two entries that APPEAR equal, but are not considered equal by the script.

string(76) "Content of Interest" 
string(47) "Content of Interest" 

My best guess is that the PHP script is rendering the HTML rather then treating each entry as the innertext of the HTML node. I want to only save a simple text version of the content pulled from each node.

How can I ensure that the string saved to the array is ONLY the text that I can see? Not rendered HTML, which contains parts that I can't see in my browser?

Use php functions like strip_tags() to receive your text without any HTML.

http://php.net/manual/en/function.strip-tags.php

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM