简体   繁体   English

在 c# 中使用正则表达式突出显示单词列表

[英]Highlight a list of words using a regular expression in c#

I have some site content that contains abbreviations.我有一些包含缩写的网站内容。 I have a list of recognised abbreviations for the site, along with their explanations.我有一个该网站公认的缩写列表,以及它们的解释。 I want to create a regular expression which will allow me to replace all of the recognised abbreviations found in the content with some markup.我想创建一个正则表达式,它允许我用一些标记替换在内容中找到的所有可识别的缩写。

For example:例如:

content:内容:

This is just a little test of the memb to see if it gets picked up. 
Deb of course should also be caught here.

abbreviations:缩写:

memb = Member; deb = Debut; 

result:结果:

This is just a little test of the [a title="Member"]memb[/a] to see if it gets picked up. 
[a title="Debut"]Deb[/a] of course should also be caught here.

(This is just example markup for simplicity). (为简单起见,这只是示例标记)。

Thanks.谢谢。

EDIT:编辑:

CraigD's answer is nearly there, but there are issues. CraigD 的答案几乎就在那里,但存在一些问题。 I only want to match whole words.我只想匹配整个单词。 I also want to keep the correct capitalisation of each word replaced, so that deb is still deb, and Deb is still Deb as per the original text.我还想保留每个替换单词的正确大小写,以便 deb 仍然是 deb,并且根据原文,Deb 仍然是 Deb。 For example, this input:例如,这个输入:

This is just a little test of the memb. 
And another memb, but not amemba. 
Deb of course should also be caught here.deb!

First you would need to Regex.Escape() all the input strings.首先,您需要Regex.Escape()所有输入字符串。

Then you can look for them in the string, and iteratively replace them by the markup you have in mind:然后您可以在字符串中查找它们,并用您想到的标记迭代地替换它们:

string abbr      = "memb";
string word      = "Member";
string pattern   = String.Format("\b{0}\b", Regex.Escape(abbr));
string substitue = String.Format("[a title=\"{0}\"]{1}[/a]", word, abbr);
string output    = Regex.Replace(input, pattern, substitue);

EDIT: I asked if a simple String.Replace() wouldn't be enough - but I can see why regex is desirable: you can use it to enforce "whole word" replacements only by making a pattern that uses word boundary anchors.编辑:我问一个简单的String.Replace()是否不够 - 但我可以看到为什么正则表达式是可取的:您只能通过制作使用单词边界锚的模式来使用它来强制“整个单词”替换。

You can go as far as building a single pattern from all your escaped input strings, like this:您可以 go 从所有转义的输入字符串构建单个模式,如下所示:

\b(?:{abbr_1}|{abbr_2}|{abbr_3}|{abbr_n})\b

and then using a match evaluator to find the right replacement.然后使用匹配评估器找到正确的替代品。 This way you can avoid iterating the input string more than once.这样,您可以避免多次迭代输入字符串。

Not sure how well this will scale to a big word list, but I think it should give the output you want (although in your question the 'result' seems identical to 'content')?不确定这将如何扩展到一个大单词列表,但我认为它应该给你想要的 output (尽管在你的问题中,“结果”似乎与“内容”相同)?

Anyway, let me know if this is what you're after无论如何,让我知道这是否是你所追求的

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = @"This is just a little test of the memb to see if it gets picked up. 
Deb of course should also be caught here.";
            var dictionary = new Dictionary<string,string>
            {
                {"memb", "Member"}
                ,{"deb","Debut"}
            };
            var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
            foreach (Match metamatch in Regex.Matches(input
               , regex  /*@"(memb)|(deb)"*/
               , RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
            { 
                input = input.Replace(metamatch.Value, dictionary[metamatch.Value.ToLower()]);
            }
            Console.Write (input);
            Console.ReadLine();
        }
    }
}

I doubt it will perform better than just doing normal string.replace, so if performance is critical measure (refactoring a bit to use a compiled regex).我怀疑它会比仅仅执行普通的 string.replace 表现更好,所以如果性能是关键衡量标准(重构一点以使用编译的正则表达式)。 You can do the regex version as:您可以将正则表达式版本执行为:

var abbrsWithPipes = "(abbr1|abbr2)";
var regex = new Regex(abbrsWithPipes);
return regex.Replace(html, m => GetReplaceForAbbr(m.Value));

You need to implement GetReplaceForAbbr, which receives the specific abbr being matched.您需要实现 GetReplaceForAbbr,它接收正在匹配的特定 abbr。

I'm doing pretty exactly what you're looking for in my application and this works for me: the parameter str is your content:我正在做的正是你在我的应用程序中寻找的东西,这对我有用:参数 str 是你的内容:

public static string GetGlossaryString(string str)
        {
            List<string> glossaryWords = GetGlossaryItems();//this collection would contain your abbreviations; you could just make it a Dictionary so you can have the abbreviation-full term pairs and use them in the loop below 

            str = string.Format(" {0} ", str);//quick and dirty way to also search the first and last word in the content.

            foreach (string word in glossaryWords)
                str = Regex.Replace(str, "([\\W])(" + word + ")([\\W])", "$1<span class='glossaryItem'>$2</span>$3", RegexOptions.IgnoreCase);

            return str.Trim();
        }

For anyone interested, here is my final solution.对于任何感兴趣的人,这是我的最终解决方案。 It is for a .NET user control.它适用于 .NET 用户控件。 It uses a single pattern with a match evaluator, as suggested by Tomalak, so there is no foreach loop.正如 Tomalak 所建议的,它使用带有匹配评估器的单个模式,因此没有 foreach 循环。 It's an elegant solution, and it gives me the correct output for the sample input while preserving correct casing for matched strings.这是一个优雅的解决方案,它为样本输入提供了正确的 output,同时为匹配的字符串保留了正确的大小写。

public partial class Abbreviations : System.Web.UI.UserControl
{
    private Dictionary<String, String> dictionary = DataHelper.GetAbbreviations();

    protected void Page_Load(object sender, EventArgs e)
    {
        string input = "This is just a little test of the memb. And another memb, but not amemba to see if it gets picked up. Deb of course should also be caught here.deb!";

        var regex = "\\b(?:" + String.Join("|", dictionary.Keys.ToArray()) + ")\\b";

        MatchEvaluator myEvaluator = new MatchEvaluator(GetExplanationMarkup);

        input = Regex.Replace(input, regex, myEvaluator, RegexOptions.IgnoreCase);

        litContent.Text = input;
    }

    private string GetExplanationMarkup(Match m)
    {
        return string.Format("<b title='{0}'>{1}</b>", dictionary[m.Value.ToLower()], m.Value);
    }
}

The output looks like this (below). output 看起来像这样(下图)。 Note that it only matches full words, and that the casing is preserved from the original string:请注意,它只匹配完整的单词,并且从原始字符串中保留了大小写:

This is just a little test of the <b title='Member'>memb</b>. And another <b title='Member'>memb</b>, but not amemba to see if it gets picked up. <b title='Debut'>Deb</b> of course should also be caught here.<b title='Debut'>deb</b>!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM