简体   繁体   English

HTML解析器无法在网页上找到表格元素

[英]HTML parsers not finding table element on a web page

I'm trying to get to this element: //*[@id="table-matches"]/table on this page: http://www.oddsportal.com/matches/soccer/20140221/ 我正在尝试访问此元素:// * [@@ =“ =” table-matches“] /此页上的表: http : //www.oddsportal.com/matches/soccer/20140221/

I want to get the table that contains matches. 我想获取包含匹配项的表。 Table starts under Kick off time tab. 表格在开始时间选项卡下开始。 The element I'm looking for is 'table class=" table-main"' and it is inside the element 'div id="table-matches" style="display: block;"' 我要查找的元素是'table class =“ table-main”',它在元素'div id =“ table-matches” style =“ display:block;”'内

I tried getting this document with HtmlAgilityPack in C# and I can find 'div' element, but it says that it doesn't have any child nodes (there should be a table child node). 我尝试使用C#中的HtmlAgilityPack获取此文档,并且可以找到'div'元素,但是它说它没有任何子节点(应该有一个表子节点)。 If I try to get the table, the result is null. 如果我尝试获取表,则结果为null。 Here is the code: 这是代码:

var webGet = new HtmlWeb();
var document = webGet.Load("http://www.oddsportal.com/matches/soccer/20140221/");
var div = document.DocumentNode.SelectNodes("//div[@id='table-matches']");
var table = document.DocumentNode.SelectNodes("//*[@id='table-matches']/table");
var table2 = document.DocumentNode.SelectNodes("//table");

So, div variable contains the div element (but it has no child nodes), table variable is null, even table2 variable contains 4 elements, but none of them are desired table. 因此,div变量包含div元素(但它没有子节点),表变量为null,即使table2变量也包含4个元素,但都不是所需的表。

I figured there is a problem with HtmlAgilityPack and tried to get the whole web page with Python. 我发现HtmlAgilityPack存在问题,并尝试使用Python获取整个网页。 So I got the whole HTML document in a text file and searched the text file and I can find div element but it is empty. 因此,我将整个HTML文档放在一个文本文件中,并搜索了该文本文件,我可以找到div元素,但它为空。 There is no table element inside. 里面没有表格元素。 Why is that? 这是为什么? Why can I see table element in chrome or internet explorer, but when I download html there is no such element? 为什么我可以在Chrome或Internet Explorer中看到表格元素,但是下载html时却没有此类元素?

Here is the python code: 这是python代码:

url = urllib.urlopen("http://www.oddsportal.com/matches/")
document = url.read()
htmlOddsPortal = open("htmlOddsPortal.txt", "w")
htmlOddsPortal.write(document)

Here is the element in the final text document: 这是最终文本文档中的元素:

<div id="table-matches"></div>                    <!--  END PAGE BODY -->

Table is loaded with JavaScript (probably with AJAX) so you won't get it with webGet.Load(). Table加载了JavaScript(可能是AJAX),因此不会通过webGet.Load()获得它。 You only get HTML that server returns in response. 您只会获得服务器返回的HTML作为响应。

You can check this if you (in Chrome) open Console (F12), click on Settings and check Disable JavaScript, then refresh page. 如果您(在Chrome中)打开控制台(F12),单击“设置”并选中“禁用JavaScript,然后刷新页面”,则可以进行检查。 You will see blank content. 您将看到空白内容。

I had same problem, but I worked in java, and I have used HTMLUnit to solve this. 我遇到了同样的问题,但是我在Java中工作,并且使用HTMLUnit解决了这个问题。 Probably there is similar tool for C#, or you can check if HtmlAgilityPack is able to do asynchronous call or something like WebBrowser component. 可能有用于C#的类似工具,或者您可以检查HtmlAgilityPack是否能够执行异步调用或类似WebBrowser组件的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM