简体   繁体   English

PHP>从html文件中提取html数据?

[英]PHP> Extracting html data from an html file?

What I've been trying to do recently is to extract listing information from a given html file, 我最近一直在尝试从给定的html文件中提取列表信息,

For example, I have an html page that has a list of many companys, with their phone number, address, etc' 例如,我有一个html页面,其中包含许多公司的列表,以及它们的电话号码,地址等。

Each company is in it's own table, every table started like that: <table border="0"> 每个公司都在自己的表中,每个表都是这样开始的: <table border="0">

I tried to use PHP to get all of the information, and use it later, like put it in a txt file, or just import into a database. 我试图使用PHP来获取所有信息,然后再使用它,例如将其放入txt文件中,或仅导入数据库中。

I assume that the way to achieve my goal is by using regex, which is one of the things that I really have problems with in php, 我认为实现目标的方法是使用正则表达式,这是我在php中确实遇到的问题之一,

I would appreciate if you guys could help me here. 如果你们能在这里帮助我,我将不胜感激。 (I only need to know what to look for, or atleast something that could help me a little, not a complete code or anything like that) (我只需要知道要查找的内容,或者至少可以提供一些帮助,而不是完整的代码或类似的东西)

Thanks in advance!! 提前致谢!!

I recommend taking a look at the PHP DOMDocument and parsing the file using an actual HTML parser, not regex. 我建议看一下PHP DOMDocument并使用实际的HTML解析器而不是regex解析文件。

There are some very straight-forward ways of getting tables, such as the GetElementsByTagName method. 有一些非常简单的获取表的方法,例如GetElementsByTagName方法。


<?php

  $htmlCode = /* html code here */

  // create a new HTML parser
  // http://php.net/manual/en/class.domdocument.php
  $dom = new DOMDocument();

  // Load the HTML in to the parser
  // http://www.php.net/manual/en/domdocument.loadhtml.php
  $dom->LoadHTML($htmlCode);

  // Locate all the tables within the document
  // http://www.php.net/manual/en/domdocument.getelementsbytagname.php
  $tables = $dom->GetElementsByTagName('table');

  // iterate over all the tables
  $t = 0;
  while ($table = $tables->item($t++))
  {
    // you can now work with $table and find children within, check for
    // specific classes applied--look for anything that would flag this
    // as the type of table you'd like to parse and work with--then begin
    // grabbing information from within it and treating it as a DOMElement
    // http://www.php.net/manual/en/class.domelement.php
  }

如果您熟悉jQuery(即使您不熟悉它的命令也很简单),我也推荐以下PHP对应版本: http : //code.google.com/p/phpquery/

如果您的HTML是有效的XML(如XHTML),则可以使用SimpleXML对其进行解析

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM