简体   繁体   English

如何在C#中解析标记的文本

[英]How to parse marked up text in C#

I am trying to make a simple text formatter using MigraDoc for actually typesetting the text. 我正在尝试使用MigraDoc制作一个简单的文本格式化程序,以对文本进行实际排版。 I'd like to specify formatting by marking up the text. 我想通过标记文本来指定格式。 For example, the input might look something like this: 例如,输入可能看起来像这样:

"The \i{quick} brown fox jumps over the lazy dog^{note}"

which would denote "quick" being italicized and "note" being superscript. 表示“快速”为斜体,“说明”为上标。 To make the splits I have made a dictionary in my TextFormatter : 为了进行拆分,我在TextFormatter创建了一个字典:

internal static TextFormatter()
    {
        FormatDictionary = new Dictionary<string, TextFormats>()            
        {
            {@"^", TextFormats.supersript},
            {@"_",TextFormats.subscript},
            {@"\i", TextFormats.italic}
        };
    }

I'm then hoping to split using some regexes that looks for the modifier strings and matches what is enclosed in braces. 然后,我希望使用一些正则表达式进行分割,以查找修饰符字符串并匹配括号中的内容。

But as multiple formats can exist in a string, I need to also keep track of which regex was matched. 但是由于字符串中可以存在多种格式,所以我还需要跟踪匹配的正则表达式。 Eg getting a List<string, TextFormats> , (where string is the enclosed string, TextFormats is the TextFormats value corresponding to the appropriate special sequence and the items are sorted in order of appearance), which I could then iterate over applying formatting based on the TextFormats . 例如,获取List<string, TextFormats> (其中string是封闭的字符串, TextFormats是与适当的特殊序列相对应的TextFormats值,并且按照出现的顺序对项目进行排序),然后我可以遍历应用基于TextFormats

Thank you for any suggestions. 感谢您的任何建议。

Consider the following Code... 考虑以下代码...

string inputMessage = @"The \i{quick} brown fox jumps over the lazy dog^{note}";
MatchCollection matches = Regex.Matches(inputMessage, @"(?<=(\\i|_|\^)\{)\w*(?=\})");

foreach (Match match in matches)
{
    string textformat = match.Groups[1].Value;
    string enclosedstring = match.Value;
    // Add to Dictionary<string, TextFormats> 
}

Good Luck! 祝好运!

I'm not sure if callbacks are available in Dot-Net, but 我不确定Dot-Net中是否提供了回调,但是

If you have strings like "The \\i{quick} brown fox jumps over the lazy dog^{note}" and 如果您有类似"The \\i{quick} brown fox jumps over the lazy dog^{note}"字符串,
you want to just do the substitution as you find them. 您只想在找到它们时进行替换即可。
Could use regex replace using a callback 可以使用正则表达式替换并使用回调

 #  @"(\\i|_|\^){([^}]*)}"

 ( \\i | _ | \^ )         # (1)
 {
 ( [^}]* )                # (2)
 }

then in callback examine capture buffer 1 for format, replace with {fmtCodeStart}\\2{fmtCodeEnd} 然后在回调中检查捕获缓冲区1的格式,替换为{fmtCodeStart}\\2{fmtCodeEnd}


or you could use 或者你可以使用

 #  @"(?:(\\i)|(_)|(\^)){([^}]*)}"

 (?:
      ( \\i )             # (1)
   |  ( _ )               # (2)
   |  ( \^ )              # (3)
 )
 {
 ( [^}]* )                # (4)
 }

then in callback 然后在回调中

 if (match.Groups[1].sucess) 
   // return "{fmtCode1Start}\4{fmtCode1End}"
 else if (match.Groups[2].sucess) 
   // return "{fmtCode2Start}\4{fmtCode2End}"
 else if (match.Groups[3].sucess) 
   // return "{fmtCode3Start}\4{fmtCode3End}"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM