简体   繁体   English

c#删除<a>字符串</a>中的特定<a>标签(仅电子邮件)</a>

[英]c# remove specific <a> tags (e-mail only) in a string

So I get a string like this from an external method: 所以我从外部方法得到这样的字符串:

var myString = "<p>Lorem &sect; 5 ipsum</p>\r\n<p><p>E-Mail: <a href=\"email@domain.com\">email@domain.com</a></p>\r\n<p>Lorem ipsum dolor sit amet</p><p><a href=\"http://www.adress.com\">name</a></p>\r\n";

I want to replace all e-mail addresses (no other links) with plain text. 我想用纯文本替换所有电子邮件地址(没有其他链接)。 So afterwards my link should look something like this: 因此,之后我的链接应如下所示:

var myClearedString = "<p>Lorem &sect; 5 ipsum</p>\r\n<p><p>E-Mail: email@domain.com</p>\r\n<p>Lorem ipsum dolor sit amet</p><p><a href=\"http://www.adress.com\">name</a></p>\r\n"

There could be 1 to n occurrences in the string. 字符串中可能出现1到n次。 I already searched stackoverflow, but the only thing related was this question: Replace mailto-links 我已经搜索过stackoverflow,但是唯一相关的问题是: 替换mailto-links

In my opinion it would be the best way to convert the string into XML and search for it. 我认为这是将字符串转换为XML并进行搜索的最佳方法。 Unfortunately it seems that some chars in my string are causing troubles (i assume it might be \\n or \\r). 不幸的是,我的字符串中似乎有些字符引起麻烦(我认为它可能是\\ n或\\ r)。

You should look into Html AgilityPack for this. 您应该为此寻找Html AgilityPack I'm sure that there are many Regular expressions that could get you most of the way but parsing HTML using Regex is generally a bad idea. 我敢肯定,有很多正则表达式可以为您提供大多数帮助,但是使用Regex解析HTML通常不是一个好主意。 See https://stackoverflow.com/a/1732454/880642 for some reasons why. 出于某些原因,请参阅https://stackoverflow.com/a/1732454/880642

Agility pack will safely parse the document for you and let you traverse it to find the links that meet your criteria. 敏捷包将为您安全地解析文档,并让您遍历文档以查找符合条件的链接。

var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlPage);
var links = htmlDocument.DocumentNode.SelectNodes("//a[@href]");
foreach (var node in links)
{
    HtmlAttribute attribute = node.Attributes["href"];
    if(IsEmail(attribute.Value))
         node.ParentNode.RemoveChild(node, true); //<-- keepGrandChildren
}
var newhtml = htmlDocument.DocumentNode.OuterHtml;

You can probably use a regex to verify that the attribute value is an email or any number of .Net functions to see whether a string is an email. 您可能可以使用正则表达式来验证属性值是电子邮件,还是使用任意数量的.Net函数来查看字符串是否是电子邮件。 I'm surprised that these aren't mailto: links but you have to work with the data that you have. 我很惊讶这些不是mailto:链接,但您必须使用已有的数据。

I'll probably be hung for suggesting this, but you could use regular expressions. 我可能会建议这样做,但您可以使用正则表达式。

Start with including the necessary dependency: 首先包括必要的依赖项:

using System.Text.RegularExpressions;

Then we need to figure out the regular expression that will identify the sub strings that match your criteria. 然后,我们需要找出正则表达式,该正则表达式将标识与您的条件匹配的子字符串。 There are several sites that offer regular expression testing. 有几个站点提供正则表达式测试。 Just search for "regular expression tester". 只需搜索“正则表达式测试器”。

This will get every anchor tag, and create 3 groups: 这将获取每个锚标记,并创建3个组:

(<a[^>]+>)(.*?)(<\/a>)

Now we need to get all the matches and replace them with the plain text value. 现在,我们需要获取所有匹配项,并将其替换为纯文本值。

We can use the Regex.Replace method to complete the task: 我们可以使用Regex.Replace方法完成任务:

string newValue = Regex.Replace(test, @"(<a[^>]+>)(.*?)(<\/a>)", (m) => 
{
    return m.Groups[2].Value;
});

This snippet is running the lambda expression for every instance matched. 此代码段为每个匹配的实例运行lambda表达式。 Then returns the value from the second group (being the content of the tag). 然后返回第二组的值(即标记的内容)。

这将是Regex和Replace的很好用

Regex.Replace(myString, @"(<a.*?>)", "").Replace("</a>","")

You can use: 您可以使用:

Regex.Replace(source, "<a.*?>", string.Empty);

or if you want replace more times you can use compiled regex: 或者,如果您想替换更多次,则可以使用编译后的正则表达式:

Regex removeRegex = new Regex("<a.*?>", RegexOptions.Compiled);

and use it as below : 并如下使用:

removeRegex.Replace(source, string.Empty);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM