简体   繁体   中英

How to extract one word or some words from an HTML page C#

Here I'm trying to extract one word from an HTML page. For example, there are two textboxes (1 and 2). now I'm trying to give stackoverflow question ID on textbox1 and get "asked" value on textbox2. For example, if I give 36 on textbox1 this should give me "9 years, 4 months ago" on textbox2. WebClient webpage = new WebClient(); String html = webpage.DownloadString("https://stackoverflow.com/questions/" + textBox1.Text); MatchCollection match = Regex.Matches(html, FILTERHERE, RegexOptions.Singleline); The problem is I don't know how to filter my output (FILTERHERE)? Also how can I send my output into textbox2?

With Windows Forms application WebBrowser control canbe used wthich wpapps the mshtml library and exposes managed HTML DOM . Example of function which retrieves the asked text:

private static string GetAskedText(HtmlDocument doc)
{
    if (doc == null)
        return "document-null";
    IEnumerable<mshtml.HTMLDivElement> divs = doc.GetElementsByTagName("div")
        .OfType<HtmlElement>()
        .Select(e => e.DomElement as mshtml.HTMLDivElement);
    foreach (var div in divs)
    {
        if (string.IsNullOrWhiteSpace(div?.className))
            continue;
        if (div.className.Trim().ToLower() != "user-info")
            continue;
        var spans = div.getElementsByTagName("span").OfType<mshtml.HTMLSpanElement>();
        foreach (var span in spans)
        {
            if (string.IsNullOrWhiteSpace(span?.className))
                continue;
            if (span.className == "relativetime")
            {
                return span.innerText;
            }
        }
    }

    return "not-found";
}

Complete example with Windows Forms application can be downloaded from my dropbox.

在此处输入图片说明

With HtmlAgilityPack .

string url = "https://stackoverflow.com/questions/";
var web = new HtmlWeb();
var doc = web.Load(url + textBox1.Text); //the text is "36"
var tag = doc.DocumentNode.SelectSingleNode("//*[@id='qinfo']//td[./p[@class='label-key' and text()='asked']]/following-sibling::td//b");
textBox2.Text = tag.InnerText;

If you don't know XPath, there are browser extensions for Chrome and Firefox that gets the XPath of any Html tag for you (I personally write them manually to make them less sensitive to changes on page structure).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM