简体   繁体   English

如何使用 C# 在网页源代码中查找 div 中的文本

[英]How do I find the text within a div in the source of a web page using C#

How can I get the HTML code from a website, save it, and find some text by using a LINQ expression?如何从网站获取HTML代码、保存它并使用LINQ表达式查找一些文本?

I'm using the following code to get the source of a web page:我正在使用以下代码来获取网页的来源:


public static String code(string Url)
{
    HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(Url);
    myRequest.Method = "GET";
    WebResponse myResponse = myRequest.GetResponse();
    StreamReader sr = new StreamReader(myResponse.GetResponseStream(),
        System.Text.Encoding.UTF8);
    string result = sr.ReadToEnd();
    sr.Close();
    myResponse.Close();
    
    return result;
}

How do I find the text within a div in the source of the web page?如何在网页源代码中找到 div 中的文本?

Better you can use the Webclient class to simplify your task:您可以更好地使用 Webclient 类来简化您的任务:

using System.Net;

using (WebClient client = new WebClient())
{
    string htmlCode = client.DownloadString("http://somesite.com/default.html");
}

Getting HTML code from a website.从网站获取 HTML 代码。 You can use code like this:你可以使用这样的代码:

string urlAddress = "http://google.com";

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

if (response.StatusCode == HttpStatusCode.OK)
{
    Stream receiveStream = response.GetResponseStream();
    StreamReader readStream = null;
    if (String.IsNullOrWhiteSpace(response.CharacterSet))
        readStream = new StreamReader(receiveStream);
    else
        readStream = new StreamReader(receiveStream,
            Encoding.GetEncoding(response.CharacterSet));
    string data = readStream.ReadToEnd();
    response.Close();
    readStream.Close();
}

This will give you the returned HTML from the website.这将为您提供从网站返回的HTML But find text via LINQ is not that easy.但是通过LINQ查找文本并不是那么容易。 Perhaps it is better to use regular expression but that does not play well with HTML .或许使用正则表达式会更好,但在HTML效果不佳。

Best thing to use is HTMLAgilityPack .最好使用的是HTMLAgilityPack You can also look into using Fizzler or CSQuery depending on your needs for selecting the elements from the retrieved page.您还可以根据从检索到的页面中选择元素的需要,考虑使用FizzlerCSQuery Using LINQ or Regukar Expressions is just to error prone, especially when the HTML can be malformed, missing closing tags, have nested child elements etc.使用 LINQ 或 Regukar 表达式只是容易出错,尤其是当 HTML 格式错误、缺少结束标记、嵌套子元素等时。

You need to stream the page into an HtmlDocument object and then select your required element.您需要将页面流式传输到 HtmlDocument 对象中,然后选择所需的元素。

// Call the page and get the generated HTML
var doc = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNode.ElementsFlags["br"] = HtmlAgilityPack.HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;

try
{
    var webRequest = HttpWebRequest.Create(pageUrl);
    Stream stream = webRequest.GetResponse().GetResponseStream();
    doc.Load(stream);
    stream.Close();
}
catch (System.UriFormatException uex)
{
    Log.Fatal("There was an error in the format of the url: " + itemUrl, uex);
    throw;
}
catch (System.Net.WebException wex)
{
    Log.Fatal("There was an error connecting to the url: " + itemUrl, wex);
    throw;
}

//get the div by id and then get the inner text 
string testDivSelector = "//div[@id='test']";
var divString = doc.DocumentNode.SelectSingleNode(testDivSelector).InnerHtml.ToString();

[EDIT] Actually, scrap that. [编辑] 实际上,废弃那个。 The simplest method is to use FizzlerEx , an updated jQuery/CSS3-selectors implementation of the original Fizzler project.最简单的方法是使用FizzlerEx ,这是原始 Fizzler 项目的更新 jQuery/CSS3 选择器实现。

Code sample directly from their site:直接来自他们网站的代码示例:

using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;

//get the page
var web = new HtmlWeb();
var document = web.Load("http://example.com/page.html");
var page = document.DocumentNode;

//loop through all div tags with item css class
foreach(var item in page.QuerySelectorAll("div.item"))
{
    var title = item.QuerySelector("h3:not(.share)").InnerText;
    var date = DateTime.Parse(item.QuerySelector("span:eq(2)").InnerText);
    var description = item.QuerySelector("span:has(b)").InnerHtml;
}

I don't think it can get any simpler than that.我认为没有比这更简单的了。

I am using AngleSharp and have been very satisfied with it.我正在使用AngleSharp并且对它非常满意。

Here is a simple example how to fetch a page:这是一个如何获取页面的简单示例:

var config = Configuration.Default.WithDefaultLoader();
var document = await BrowsingContext.New(config).OpenAsync("https://www.google.com");

And now you have a web page in document variable.现在您在文档变量中有一个网页。 Then you can easily access it by LINQ or other methods.然后就可以很方便的通过LINQ或者其他方式访问了。 For example if you want to get a string value from a HTML table:例如,如果您想从 HTML 表中获取字符串值:

var someStringValue = document.All.Where(m =>
        m.LocalName == "td" &&
        m.HasAttribute("class") &&
        m.GetAttribute("class").Contains("pid-1-bid")
    ).ElementAt(0).TextContent.ToString();

To use CSS selectors please see AngleSharp examples .要使用 CSS 选择器,请参阅AngleSharp 示例

Here's an example of using the HttpWebRequest class to fetch a URL这是使用HttpWebRequest类获取 URL 的示例

private void buttonl_Click(object sender, EventArgs e) 
{ 
    String url = TextBox_url.Text;
    HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url); 
    HttpWebResponse response = (HttpWebResponse) request.GetResponse(); 
    StreamReader sr = new StreamReader(response.GetResponseStream()); 
    richTextBox1.Text = sr.ReadToEnd(); 
    sr.Close(); 
} 

You can use WebClient to download the html for any url.您可以使用 WebClient 下载任何 url 的 html。 Once you have the html, you can use a third-party library like HtmlAgilityPack to lookup values in the html as in below code -获得 html 后,您可以使用第三方库(如HtmlAgilityPack)在 html 中查找值,如下面的代码所示 -

public static string GetInnerHtmlFromDiv(string url)
    {
        string HTML;
        using (var wc = new WebClient())
        {
            HTML = wc.DownloadString(url);
        }
        var doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(HTML);
        
        HtmlNode element = doc.DocumentNode.SelectSingleNode("//div[@id='<div id here>']");
        if (element != null)
        {
            return element.InnerHtml.ToString();
        }   
        return null;            
    }

Try this solution.试试这个解决方案。 It works fine.它工作正常。

 try{
        String url = textBox1.Text;
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        StreamReader sr = new StreamReader(response.GetResponseStream());
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(sr);
        var aTags = doc.DocumentNode.SelectNodes("//a");
        int counter = 1;
        if (aTags != null)
        {
            foreach (var aTag in aTags)
            {
                richTextBox1.Text +=  aTag.InnerHtml +  "\n" ;
                counter++;
            }
        }
        sr.Close();
        }
        catch (Exception ex)
        {
            MessageBox.Show("Failed to retrieve related keywords." + ex);
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM