[英]get information from a website with php, Recursively traverse each HTML node
I have done lot research but I haven't found my answer. 我做了很多研究,但没有找到答案。 I am trying to get some information from webpage, Which have the following HTML structure
我正在尝试从网页中获取一些信息,该网页具有以下HTML结构
<div id="xxx" class="some1">
<h1>This is the time</h1>
<div class="ti12">
<div class="sss"></div>
<div class="sss">
<span class="hhh">
<div class="sded">
City:
<span class="sh">CCC</span>
</div>
</span>
</div>
</div>
.
.
.
<div class="pp12"></div>
</div>
Now, What i am doing is to fetch the NAME of the City and similarly other information in same way. 现在,我正在做的是以相同的方式获取城市名称和其他类似信息。
I have to find these information from above code. 我必须从上面的代码中找到这些信息。
$arr=array('City', 'Name', 'Address', 'DOB');
if exist fetch its value else leave it blank. 如果存在,则获取其值,否则留空。
Hope my I am clear. 希望我清楚。
Following code it tried: 以下代码尝试了:
<?php
include "simple_html_dom.php";
$html = new simple_html_dom();
$listItem = array('City', 'Name', 'Address', 'DOB');
$html->load_file('simp.html');
$found=array();
foreach($listItem as $item){
$ret = $html->find('div[id=xxx] div',0);
iterateParentNode($ret, $item);
}
function iterateParentNode($ret1, $item1){
for ($node=0;$node < count($ret1->children());$node++){
$child=$ret1->children($node);
echo count($ret1->children())."<br/>";
if(count($ret1->children())==1 && strpos($child, '<span class="sh"')!==false ){
$found[$item1]=$ret1->find('span[class=sh]',0)->plaintext;
return true;
}else{
goThroughChildNode($child, $item1);
}
}
}
function goThroughChildNode($child1, $item2){
echo $child1."ITEM:".$item2;
if(strpos($child1, $item2)!==false){
iterateParentNode($child1, $item2);
}else{
return false ;
}
return true;
}
foreach ($found as $structure=>$data){
echo $structure."=>".$data."<br />";
}
?>
I know my PHP approach is not good, So please suggest me a good approach to do it with considering my PHP code. 我知道我的PHP方法不好,所以请考虑考虑我的PHP代码,为我推荐一个好的方法。
It would probably be simplest to do this with a regex. 使用正则表达式执行此操作可能最简单。 Of course, it will break if the HTML structure changes.
当然,如果HTML结构发生变化,它将中断。
if (ereg('<div.*?h1>(.*?)</h1>.*?City:.*?>(.*?)<', $input, $regs)) {
$title = $regs[1];
$city = $regs[2];
} else {
$title = "";
$city = "";
}
/*
Match 1 of 1
Matched text: <div id="xxx" class="some1">
<h1>This is the time</h1>
<div class="ti12">
<div class="sss"></div>
<div class="sss">
<span class="hhh">
<div class="sded">
City:
<span class="sh">CCC<
Match offset: 0
Match length: 282
Group 1: This is the time
Group 1 offset: 42
Group 1 length: 16
Group 2: CCC
Group 2 offset: 278
Group 2 length: 3
*/
// <div.*?h1>(.*?)</h1>.*?City:.*?>(.*?)<
//
// Match the characters "<div" literally «<div»
// Match any single character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "h1>" literally «h1>»
// Match the regular expression below and capture its match into backreference number 1 «(.*?)»
// Match any single character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "</h1>" literally «</h1>»
// Match any single character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the characters "City:" literally «City:»
// Match any single character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the character ">" literally «>»
// Match the regular expression below and capture its match into backreference number 2 «(.*?)»
// Match any single character «.*?»
// Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
// Match the character "<" literally «<»
One alternative to manual traversal is querying for the data instead. 手动遍历的一种替代方法是查询数据。 In DOMDocument this is commonly done with XPath, a language dedicated to exactly that job.
在DOMDocument中,这通常使用XPath来完成,XPath是专门用于该工作的语言。
The library you use does not support XPath , however, PHP does support it out of the box. 您使用的库不支持XPath ,但是PHP确实支持它。 PHP also supports DOMDocument out of the box, so I think I can safely suggest you that as an alternative.
PHP还开箱即用地支持DOMDocument ,所以我可以安全地建议您使用DOMDocument 。
So in your case you are first looking into the the div with the ID: 因此,在您的情况下,您首先要查看具有ID的div:
//div[@id="xxx"]
and then inside a div in there somewhere: 然后在某个地方的div内:
//div
and then you want another element in there if no specific name (children): 然后如果没有特定名称(子代),则要在其中添加另一个元素:
//*
but those need to match a specific pattern: Here, containing a span with a class attribute having "sh", it must be the first span in there and before the span there must be some text: 但是需要匹配特定的模式:在这里,包含一个带有带有“ sh”的class属性的范围,它必须是其中的第一个范围,并且在范围之前必须有一些文本:
[
span[@class="sh"]
and span = span[@class="sh"]
and span/preceding-sibling::text()
]
and of that child you want the first text node child: 并且您想要那个孩子的第一个文本节点孩子:
/text()[1]
So just to see this at a glance: 因此,一目了然:
//div[@id="xxx"]
//div
//*[
span[@class="sh"]
and span = span[@class="sh"]
and span/preceding-sibling::text()
]
/text()[1]
This will give you the named string like "City:" and so on. 这将为您提供诸如“ City:”之类的命名字符串,依此类推。 The next sibling (span) then will contain the value.
然后下一个同级(跨度)将包含该值。
All you've got to do is wrap that into code (here I load a string, but you can also load a HTML file with loadHTMLFile()
, check the DOMDocument link above for all the glory details): 您所需要做的就是将其包装到代码中(这里我加载了一个字符串,但是您也可以使用
loadHTMLFile()
加载一个HTML文件,请查看上面的DOMDocument链接以获取所有荣耀的详细信息):
$dom = new DOMDocument();
$dom->loadHTML($string);
$xp = new DOMXPath($dom);
foreach ($xp->query('
//div[@id="xxx"]
//div
//*[
span[@class="sh"]
and span = span[@class="sh"]
and span/preceding-sibling::text()
]
/text()[1]
'
) as $node
) {
$name = trim($node->nodeValue);
$value = trim($node->nextSibling->nodeValue);
printf("%s %s\n", $name, $value);
}
The output with your example HTML: 示例HTML的输出:
City: CCC
I hope this can motivate you to look into DOMDocument and helps you to explore the power of XPath. 我希望这可以激发您研究DOMDocument并帮助您探索XPath的功能。
It took me a while to get it right, but this code traverses the entire DOM with Simple HTML Dom. 我花了一些时间才把它弄对,但是这段代码通过Simple HTML Dom遍历了整个DOM。 Hope someone can use it.
希望有人可以使用它。
<?php
$html = new simple_html_dom();
$html->load('<html><body>'.$text.'</body></html>');
if(method_exists($html,"childNodes")){
if($html->find('html')) {
//IF NOT OK, THROW ERROR
}}
$e=$html->find('body',0);
$p=$e->childNodes(0);
if(!$p){
//BODY HAS NO CHILDNODES< THROW ERROR
}
$loop=true;
//SAFEGUARD, PREVENTS INDEFINITE LOOPS
$i=$j=0;
$i_max=500;
$j_max=500;
while($loop==true){
//SAFEGUARD, PREVENTS INDEFINITE LOOPS
$i=0;$i++;if($i>$i_max){$loop=false;break;}
//TEST IF NODE HAS CHILDREN
$p=$e->childNodes(0);
//NO CHILDREN
if(!$p){
//DO SOMETHING WITH NODE
clean_dom($e->outertext);
//TEST IF NODE HAS SIBLING
$p=$e->next_sibling();
if(!$p){
//NO SIBLING
//TEST THE PARENT, LOOP TILL WE FIND A SIBLING
$j=0;$sib_loop=true;
while($sib_loop==true){
//SAFEGUARD, PREVENTS INDEFINITE LOOPS
$j++;if($j>$j_max){$sib_loop=false;break;}
//TEST IF THERE IS A PARENT
$e=$e->parent();
//NO PARENT, WE'VE REACHED THE TOP AGAIN
if(!$e){
echo'***THE END***';
$sib_loop=$loop=false;break;}
//ELSE, TEST IF PARENT HAS SIBLING
$p=$e->next_sibling();
//THERE IS A SIBBLING, GO THERE
if($p){
//DO SOMETHING WITH THIS NODE
clean_dom($e->outertext);
$e=$e->next_sibling();
$sib_loop=false;break;
}
else{
$ret=clean_dom($e->outertext,$all);
$e->outertext=$ret;
}
}
}
else{
//GOTO SIBLING
$e=$e->next_sibling();
}
}
else{
//THERE IS A CHILD
$e=$e->childNodes(0);
}
}
$text=$html->save();
$html->clear();
unset($html);
function clean_dom($e){
//DO SOMETHING HERE
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.