简体   繁体   English

使用PHP抓取内容时不要渲染html

[英]Don't render html when scraping content with PHP

I'm working on a scraper to collect contact information for a marketing project, but I'm running into an issue with trying to organize the scraped data within my script. 我正在使用一个抓取工具来收集营销项目的联系信息,但是在尝试在脚本中组织抓取的数据时遇到了问题。 One of the biggest issues I'm running into is as follows: 我遇到的最大问题之一是:

<font attribute="something">

   <font otherattribute="somethingelse">

      <font otherattribute="onemore">

         Content of Interest

      </font>
   </font>
</font>

When trying to parse the DOM and scrape out the content of interest, my script looks for <font> within another <font> and saves all content it finds to an array as unique entries. 当试图解析DOM和刮出来的感兴趣的内容,我的脚本查找<font>另一个内<font>并保存它找到一个数组作为唯一条目的所有内容。 The issue, however, is that I'm finding repeat entries within the array. 但是,问题是我在数组中发现重复条目​​。 I tried having the script check for quality between two successive entries before pushing them into the array, but I get results like the following when var_dump() is called on two entries that APPEAR equal, but are not considered equal by the script. 我尝试让脚本检查两个连续条目之间的质量,然后再将其推入数组,但是当在两个看起来相等但不被脚本视为相等的条目上调用var_dump()时,我得到如下结果。

string(76) "Content of Interest" 
string(47) "Content of Interest" 

My best guess is that the PHP script is rendering the HTML rather then treating each entry as the innertext of the HTML node. 我最好的猜测是,PHP脚本正在呈现HTML,而不是将每个条目都视为HTML节点的innertext I want to only save a simple text version of the content pulled from each node. 我只想保存从每个节点提取的内容的简单文本版本。

How can I ensure that the string saved to the array is ONLY the text that I can see? 如何确保保存到数组的字符串仅是我看到的文本? Not rendered HTML, which contains parts that I can't see in my browser? 没有呈现HTML,其中包含我在浏览器中看不到的部分?

Use php functions like strip_tags() to receive your text without any HTML. 使用strip_tags()之类的php函数来接收没有任何HTML的文本。

http://php.net/manual/en/function.strip-tags.php http://php.net/manual/en/function.strip-tags.php

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM