如何在C#中仅从html中获取父标签文本

Question

i am actually trying to grap the text from a tag which has some child tags我实际上是在尝试从具有一些子标签的标签中抓取文本

For example :例如：

<p><span>Child Text </span><span class="price">Child Text</span><br />
I need this text</p>

This is what i am trying这就是我正在尝试的

HtmlElement menuElement = browser.Document.GetElementsByTagName("p");
String mytext = menuElement.InnerHtml;   //also tried innerText,OuterHtml,OuterText

UPDATE: I think i have to use Htmlagilitypack, so now my question is how to do this using htmlagilitypack lib, I'm new to it.更新：我想我必须使用 Htmlagilitypack，所以现在我的问题是如何使用 htmlagilitypack lib 来做到这一点，我是新手。

Thanks谢谢

Answer 1

There are many approaches to this from using regex to web scraping libraries.i recommend you to use htmlagilitypack with that you can address exactly what you need by xpath.从使用正则表达式到网络抓取库，有很多方法可以解决这个问题。我建议您使用 htmlagilitypack，这样您就可以通过 xpath 准确地满足您的需求。 add reference and namespace to HtmlAgilityPack and i 'm using linq(this requires .net 3.5 or better) with the code below you can do that.向 HtmlAgilityPack 添加引用和命名空间，我正在使用 linq（这需要 .net 3.5 或更高版本）和下面的代码，你可以做到这一点。

using HtmlAgilityPack;
using System.Linq;

// these references must be available. // 这些引用必须可用。

        private void Form1_Load(object sender, EventArgs e)
        {
            var rawData = "<p><span>Child Text </span><span class=\"price\">Child Text</span><br />I need this text</p>";
            var html = new HtmlAgilityPack.HtmlDocument();
            html.LoadHtml(rawData);
            html.DocumentNode.SelectNodes("//p/text()").ToList().ForEach(x=>MessageBox.Show(x.InnerHtml));
        }

Answer 2

It's much, much easier if you can put the "need this text" inside a span with an id -- then you just grab that id's .innerHTML().如果您可以将“需要此文本”放在带有 id 的跨度内，那就容易多了——然后您只需获取该 id 的 .innerHTML()。 If you can't change the markup, you can grab menuElement's .innerHTML() and string match for content after "如果你不能改变标记，你可以抓取menuElement的.innerHTML()和字符串匹配后的内容
", but that's quite fragile. ”，但这很脆弱。

Answer 3

You can get the text by splitting the DocumentText up into different parts.您可以通过将 DocumentText 分成不同的部分来获取文本。

string text = "<p><span>Child Text </span><span class="price">Child Text</span><br />I need this text</p>";
text = text.Split(new string{"<p><span>Child Text </span><span class="price">Child Text</span><br />"}, StringSplitOptions.None)[1];
// Splits the first part of the text, leaving us with "I need this text</p>"
// We can remove the last </p> many ways, but here I will show you one way.
text = text.Split(new string{"</p>"}, StringSplitOptions.None)[0];
// text now has the value of "I need this text"

Hope this Helps!希望这可以帮助！

如何在C#中仅从html中获取父标签文本

问题描述

3 个解决方案

解决方案1
2 已采纳 2012-04-28 19:49:18

解决方案2
0 2012-04-28 19:33:57

解决方案3
0 2012-04-28 21:00:51

如何在C#中仅从html中获取父标签文本

问题描述

3 个解决方案

解决方案1 2 已采纳 2012-04-28 19:49:18

解决方案2 0 2012-04-28 19:33:57

解决方案3 0 2012-04-28 21:00:51

解决方案1
2 已采纳 2012-04-28 19:49:18

解决方案2
0 2012-04-28 19:33:57

解决方案3
0 2012-04-28 21:00:51