简体   繁体   English

从HTML页面提取特定文本

[英]Extract Specific Text from Html Page

Html page is look like this HTML页面看起来像这样

<tr>
<th rowspan="4" scope="row">General</th>
<td class="ttl"><a href="network-bands.php3">2G Network</a></td>
<td class="nfo">GSM 850 / 900 / 1800 / 1900 </td>
</tr><tr>
<td class="ttl"><a href="network-bands.php3">3G Network</a></td>
<td class="nfo">HSDPA 900 / 1900 / 2100 </td>
</tr>

for that i am try to use 为此,我尝试使用

var text = document.getElementsByClassName("nfo")[0].innerHTML;

Provided By Alex 由Alex提供

But i am getting this error Error 2 The name 'document' does not exist in the current context C:\\Users\\Nabi Javid\\Documents\\Visual Studio 2008\\Projects\\WpfApplication2\\WpfApplication2\\Window1.xaml.cs 30 22 WpfApplication2 但我收到此错误错误2在当前上下文中不存在名称“文档” C:\\ Users \\ Nabi Javid \\ Documents \\ Visual Studio 2008 \\ Projects \\ WpfApplication2 \\ WpfApplication2 \\ Window1.xaml.cs 30 22 WpfApplication2

Am i missing some Libary or something 我想念一些图书馆书吗

Currently my code is like that 目前我的代码是这样的

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Windows;
using System.Windows.Controls;
using System.Windows.Data;
using System.Windows.Documents;
using System.Windows.Input;
using System.Windows.Media;
using System.Windows.Media.Imaging;
using System.Windows.Navigation;
using System.Windows.Shapes;

namespace WpfApplication1
{
    /// <summary>
    /// Interaction logic for Window1.xaml
    /// </summary>
    public partial class Window1 : Window
    {
        public Window1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, RoutedEventArgs e)
        {
            HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
            htmlDoc.Load("nokia_c5_03-3578.html");
             var text = document.getElementsByClassName("nfo")[0].innerHTML;

        } 
    }

}

You are mixing C# code with javascript code. 您正在将C#代码与javascript代码混合在一起。

Instead of this: 代替这个:

var text = document.getElementsByClassName("nfo")[0].innerHTML;

type this: 输入:

var text = htmlDoc.DocumentNode.SelectNodes("//td[@class='nfo']")[0].InnerHtml;

To keep it simple, I have refrained from checking exceptions. 为简单起见,我避免检查异常。

I'm not very deep into .net but it looks like you are trying to mix JavaScript-code 我对.net不太了解,但看起来您正在尝试混合JavaScript代码

var text = document.getElementsByClassName("nfo")[0].innerHTML;

with your .net code...? 与您的.net代码...?

You can get elements by class name using next method which return elements where are several classes defined in one class attribute: 您可以使用next方法按类名获取元素,该方法返回在一个class属性中定义了几个类的元素:

private HtmlNodeCollection GetElementsByClassName(HtmlDocument htmlDocument, string className)
{
    string xpath =
        String.Format(
            "//*[contains(concat(' ', normalize-space(@class), ' '), ' {0} ')]",
            className);
    return htmlDocument.DocumentNode.SelectNodes(xpath);
}

You must use the htmlDoc variable to call methods in your case. 在这种情况下,必须使用htmlDoc变量来调用方法。 By the way the HtmlDocument class does not have a method with that name. 顺便说一句, HtmlDocument类没有使用该名称的方法。 Try to see if you can find another match for your needs in this list . 尝试查看是否可以在此列表中找到满足您需求的其他匹配项。

As the error says, the document variable does not exits in your code. 如错误所示, document变量不会在您的代码中退出。

do you want 你想要

var text = htmlDoc.getElementsByClassName("nfo")[0].innerHTML;

? Not familiar with HTML Agility Pack, but that would seem to make sense 不熟悉HTML Agility Pack,但这似乎很有意义

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM