简体   繁体   English

如何在C#中仅从html中获取父标签文本

[英]how to get only the parent tag text from html in C#

i am actually trying to grap the text from a tag which has some child tags我实际上是在尝试从具有一些子标签的标签中抓取文本

For example :例如

<p><span>Child Text </span><span class="price">Child Text</span><br />
I need this text</p>

This is what i am trying这就是我正在尝试的

HtmlElement menuElement = browser.Document.GetElementsByTagName("p");
String mytext = menuElement.InnerHtml;   //also tried innerText,OuterHtml,OuterText

UPDATE: I think i have to use Htmlagilitypack, so now my question is how to do this using htmlagilitypack lib, I'm new to it.更新:我想我必须使用 Htmlagilitypack,所以现在我的问题是如何使用 htmlagilitypack lib 来做到这一点,我是新手。

Thanks谢谢

There are many approaches to this from using regex to web scraping libraries.i recommend you to use htmlagilitypack with that you can address exactly what you need by xpath.从使用正则表达式到网络抓取库,有很多方法可以解决这个问题。我建议您使用 htmlagilitypack,这样您就可以通过 xpath 准确地满足您的需求。 add reference and namespace to HtmlAgilityPack and i 'm using linq(this requires .net 3.5 or better) with the code below you can do that.向 HtmlAgilityPack 添加引用和命名空间,我正在使用 linq(这需要 .net 3.5 或更高版本)和下面的代码,你可以做到这一点。

using HtmlAgilityPack;
using System.Linq;

// these references must be available. // 这些引用必须可用。

        private void Form1_Load(object sender, EventArgs e)
        {
            var rawData = "<p><span>Child Text </span><span class=\"price\">Child Text</span><br />I need this text</p>";
            var html = new HtmlAgilityPack.HtmlDocument();
            html.LoadHtml(rawData);
            html.DocumentNode.SelectNodes("//p/text()").ToList().ForEach(x=>MessageBox.Show(x.InnerHtml));
        }

It's much, much easier if you can put the "need this text" inside a span with an id -- then you just grab that id's .innerHTML().如果您可以将“需要此文本”放在带有 id 的跨度内,那就容易多了——然后您只需获取该 id 的 .innerHTML()。 If you can't change the markup, you can grab menuElement's .innerHTML() and string match for content after "如果你不能改变标记,你可以抓取menuElement的.innerHTML()和字符串匹配后的内容
", but that's quite fragile. ”,但这很脆弱。

You can get the text by splitting the DocumentText up into different parts.您可以通过将 DocumentText 分成不同的部分来获取文本。

string text = "<p><span>Child Text </span><span class="price">Child Text</span><br />I need this text</p>";
text = text.Split(new string{"<p><span>Child Text </span><span class="price">Child Text</span><br />"}, StringSplitOptions.None)[1];
// Splits the first part of the text, leaving us with "I need this text</p>"
// We can remove the last </p> many ways, but here I will show you one way.
text = text.Split(new string{"</p>"}, StringSplitOptions.None)[0];
// text now has the value of "I need this text"

Hope this Helps!希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从C#Windows窗体中的html代码获取父标记 - Get the parent tag from html code in C# Windows Forms 如何使用 C# 从 HTML 仅获取纯文本? - How to get only plain text from HTML using C#? 从中获取文字 <Div> 标记从HTML页面到C# - Get text from a <Div> tag from a html page to c# 使用正则表达式C#获取选定的文本父标签 - Get seleted text parent tag using regex C# 如何从ac#程序填充html应用程序的html输入标签(文本键入)值 - how to fill html input tag(TEXT typed) value of a html application from a c# program 如何发送带有 HTML 标记的电子邮件,而不是 C# 中的 HTML 纯文本 - How to send an email with HTML tag, not a HTML plain text in C# 如何从父元素中获取文本并从子元素中排除文本(C#Selenium) - How to get text from parent element and exclude text from children (C# Selenium) 如何删除html标记,而只在C#中保留文本? - How to remove html tag and just leave text in C#? 如何使用 C# 从多级嵌入式 MongoDB 文档中获取具有相应父级的确切子元素 - How to get the exact child element with corresponding parent only from multilevel embedded MongoDB document using C# 如何从C#中的img html标签获取图像源属性值? - How to get image source attribute value from img html tag in c#?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM