简体   繁体   English

使用DomXPath刮取

[英]Scraping using DomXPath

Using PHP DomXPath to scrape some websites. 使用PHP DomXPath抓取某些网站。

Currently using this tutorial to traverse XPaths. 当前正在使用本教程遍历XPath。

I am currently scraping this site , getting the character names and Steam ID (the mess of an XPath below is what gets one Steam ID). 我目前正在抓取该网站 ,获取角色名称和Steam ID(下面的XPath混乱是得到一个Steam ID的原因)。

My question is - there are multiple Steam IDs and character names. 我的问题是-有多个Steam ID和角色名称。 The XPath that I painstakingly created only gets one. 我辛苦创建的XPath只有一个。

How should I scrape all of the Steam IDs instead of just one of them? 我应该如何抓取所有 Steam ID而不是其中一个?

$xpath = new DomXPath($this->ourTeamHTML);

/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");

$steamName = $xpath->query('//*[@id="wrapper"]/section/div/div[1]/div[2]/div[2]/div[1]/div/div/div[1]/div/div[1]/h5/b');
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($steamName as $node) {
    echo "Steam Name: " . $node->nodeValue . "\n";
}

Your xpath is too verbose, having full path and element indexes it is not intuitive to read and tends to break due to slight changes in the page source. 您的xpath太冗长,具有完整的路径和元素索引,阅读起来不直观,并且由于页面源的细微变化而趋于中断。 Try using the following simpler xpath : 尝试使用以下更简单的xpath:

//*[@id="wrapper"]//div[@class='col-md-12']//h5/b

It worked for me to get all Steam ID's and character names (total of 32 elements) from the linked page (tested using firefox's firepath add-on) 它对我有用,可以从链接页面获取所有Steam ID和字符名称(总共32个元素)(使用firefox的firepath附加组件进行了测试)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM