简体   繁体   English

在html标签Xpath和HtmlAgility之间获取价值

[英]Get value between html tags Xpath and HtmlAgility

So Far I am trying to retrieve the text between HTML tags for a certain website.... 到目前为止,我正在尝试检索某个网站的HTML标签之间的文本。

Say for instance I need to extract out the text between these span tags how would I go about that, I am receiving an error stating "the object reference not set to an instance of an object" here is the HTML 假设我需要提取这些span标签之间的文本,我将如何处理,我收到一条错误消息,指出“对象引用未设置为对象的实例”这是HTML

There is also HTML Code prior to this portion here; 在此部分之前,还有HTML代码。 I don't know if that should make a difference. 我不知道这是否应该有所作为。

<div class="thumbnail-details">
<ul>
    <li> … </li>
    <li class="product-title">
        <span class="thumbnail-details-grey">The Blaster Portable Wireless Speaker in Black</span>
    </li>
    <li> … </li>
</ul>
</div>

So far my C# code is 到目前为止,我的C#代码是

    HtmlWeb hw = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument htmlDoc = hw.Load(@"http://www.karmaloop.com/Browse.htm#Pgroup=1");
        if (htmlDoc.DocumentNode != null)
        {
            foreach (HtmlNode text in htmlDoc.DocumentNode.SelectNodes("//span[@class='thumbnail-details-grey']/text()"))
            {
                Console.WriteLine(text.InnerText);
            }

Can I get some help here, I want to extract out "The Blaster Portable Wireless Speaker in Black". 我可以在这里得到一些帮助吗,我想提取“黑色的The Blaster便携式无线扬声器”。

I'd recommend using CsQuery ( https://www.nuget.org/packages/CsQuery/1.3.4 ) and then it's as simple as: 我建议使用CsQuery( https://www.nuget.org/packages/CsQuery/1.3.4 ),然后它就很简单:

var doc = CQ.CreateFromUrl(@"http://www.karmaloop.com/Browse.htm");
var nodes = doc.Find("span.thumbnail-details-grey");
foreach(var node in nodes)
    Console.WriteLine(node.InnerText);

Your code works just fine, but you'll have to load the right page to get it to work. 您的代码工作正常,但是您必须加载正确的页面才能使其正常工作。 The page you are loading uses an ajax request to load the results you see in your browser. 您正在加载的页面使用ajax请求来加载您在浏览器中看到的结果。

So instead of the url you are currently using you have to use: 因此,您必须使用:

HtmlDocument htmlDoc = hw.Load(@"http://www.karmaloop.com/Browse?Pgroup=1&ajax=true&version=2");

Then your code works. 然后您的代码就可以了。 I'm still looking for the place this request gets put together... 我仍在寻找这个请求放在一起的地方...

But the query looks rather easy to guess. 但是查询看起来很容易猜测。 For example the page http://www.karmaloop.com/Browse.htm#Pdept=11&PageSize=30&Pgroup=1 request the url http://www.karmaloop.com/Browse?Pdept=11&PageSize=30&Pgroup=1&ajax=true&version=2 . 例如,页面http://www.karmaloop.com/Browse.htm#Pdept=11&PageSize=30&Pgroup=1请求网址http://www.karmaloop.com/Browse?Pdept=11&PageSize=30&Pgroup=1&ajax=true&version=2 So all you have to do is use your url and build a new one starting after the # . 因此,您所需要做的就是使用您的网址并在#之后开始构建一个新网址。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM