简体   繁体   English

在C#中,解析此WIKI标记的最佳方法是什么?

[英]In C#, what is the best way to parse this WIKI markup?

I need to take data that I am reading in from a WIKI markup page and store it as a table structure. 我需要从WIKI标记页面获取我正在读取的数据并将其存储为表结构。 I am trying to figure out how to properly parse the below markup syntax into some table data structure in C# 我试图找出如何正确解析下面的标记语法到C#中的一些表数据结构

Here is an example table: 这是一个示例表:

 || Owner || Action || Status || Comments ||
 | Bill | Fix the lobby | In Progress | This is easy |
 | Joe | Fix the bathroom | In Progress | Plumbing \\
 \\
  Electric \\
 \\
 Painting \\
 \\
 \\ | 
 | Scott | Fix the roof | Complete | This is expensive |

and here is how it comes in directly: 以下是它的直接来源:

|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive| 

So as you can see: 所以你可以看到:

  • The column headers have "||" 列标题有“||” as the separator 作为分隔符
  • A row columns have a separator or "|" 行列有一个分隔符或“|”
  • A row might span multiple lines (as in the second data row example above) so i would have to keep reading until I hit the same number of "|" 一行可能跨越多行(如上面的第二个数据行示例),所以我必须继续阅读,直到我达到相同数量的“|” (cols) that I have in the header row. (cols)我在标题行中。

I tried reading in line by line and then concatenating lines that had "\\" in between then but that seemed a bit hacky. 我尝试逐行阅读,然后连接之间有“\\”的行,但这似乎有点hacky。

I also tried to simply read in as a full string and then just parse by "||" 我也试着简单地读作一个完整的字符串,然后用“||”解析 first and then keep reading until I hit the same number of "|" 首先然后继续阅读,直到我达到相同数量的“|” and then go to the next row. 然后转到下一行。 This seemed to work but it feel like there might be a more elegant way using regular expressions or something similar. 这似乎有效,但感觉可能有更优雅的方式使用正则表达式或类似的东西。

Can anyone suggest the correct way to parse this data? 任何人都可以建议解析这些数据的正确方法吗?

I have largely replaced the previous answer, due to the fact that the format of the input after your edit is substantially different from the one posted before. 由于编辑后的输入格式与之前发布的格式大不相同,因此我在很大程度上取代了之前的答案。 This leads to a somewhat different solution. 这导致了一个稍微不同的解决方案。

Because there are no longer any line breaks after a row, the only way to determine for sure where a row ends, is to require that each row has the same number of columns as the table header. 因为行之后不再有任何换行符,所以确定行结束位置的唯一方法是要求每行具有与表头相同的列数。 That is at least if you don't want to rely on some potentially fragile white space convention present in the one and only provided example string (ie that the row separator is the only | not preceded by a space). 至少,如果您不想依赖于一个且仅提供示例字符串中存在的一些可能脆弱的空白约定(即行分隔符是唯一的|前面没有空格)。 Your question at least does not provide this as the specification for a row delimiter. 您的问题至少不提供此作为行分隔符的规范。

The below "parser" provides at least the error handling validity checks that can be derived from your format specification and example string and also allows for tables that have no rows. 下面的“解析器”至少提供了可以从您的格式规范和示例字符串派生的错误处理有效性检查,并且还允许没有行的表。 The comments explain what it is doing in basic steps. 这些评论解释了它在基本步骤中的作用。

public class TableParser
{
    const StringSplitOptions SplitOpts = StringSplitOptions.None;
    const string RowColSep = "|";
    static readonly string[] HeaderColSplit = { "||" };
    static readonly string[] RowColSplit = { RowColSep };
    static readonly string[] MLColSplit = { @"\\" };

    public class TableRow
    {
        public List<string[]> Cells;
    }

    public class Table
    {
        public string[] Header;
        public TableRow[] Rows;
    }

    public static Table Parse(string text)
    {
        // Isolate the header columns and rows remainder.
        var headerSplit = text.Split(HeaderColSplit, SplitOpts);
        Ensure(headerSplit.Length > 1, "At least 1 header column is required in the input");

        // Need to check whether there are any rows.
        var hasRows = headerSplit.Last().IndexOf(RowColSep) >= 0;
        var header = headerSplit.Skip(1)
            .Take(headerSplit.Length - (hasRows ? 2 : 1))
            .Select(c => c.Trim())
            .ToArray();

        if (!hasRows) // If no rows for this table, we are done.
            return new Table() { Header = header, Rows = new TableRow[0] };

        // Get all row columns from the remainder.
        var rowsCols = headerSplit.Last().Split(RowColSplit, SplitOpts);

        // Require same amount of columns for a row as the header.
        Ensure((rowsCols.Length % (header.Length + 1)) == 1, 
            "The number of row colums does not match the number of header columns");
        var rows = new TableRow[(rowsCols.Length - 1) / (header.Length + 1)];

        // Fill rows by sequentially taking # header column cells 
        for (int ri = 0, start = 1; ri < rows.Length; ri++, start += header.Length + 1)
        {
            rows[ri] = new TableRow() { 
                Cells = rowsCols.Skip(start).Take(header.Length)
                    .Select(c => c.Split(MLColSplit, SplitOpts).Select(p => p.Trim()).ToArray())
                    .ToList()
            };
        };

        return new Table { Header = header, Rows = rows };
    }

    private static void Ensure(bool check, string errorMsg)
    {
        if (!check)
            throw new InvalidDataException(errorMsg);
    }
}

When used like this: 当像这样使用时:

public static void Main(params string[] args)
{
        var wikiLine = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
        var table = TableParser.Parse(wikiLine);

        Console.WriteLine(string.Join(", ", table.Header));
        foreach (var r in table.Rows)
            Console.WriteLine(string.Join(", ", r.Cells.Select(c => string.Join(Environment.NewLine + "\t# ", c))));
}

It will produce the below output: 它会产生以下输出:

产量

Where "\\t# " represents a newline caused by the presence of \\\\ in the input. 其中"\\t# "表示由输入中存在\\\\引起的换行符。

Here's a solution which populates a DataTable. 这是一个填充DataTable的解决方案。 It does require a litte bit of data massaging (Trim), but the main parsing is Splits and Linq. 它需要一点点的数据按摩(Trim),但主要的解析是Splits和Linq。

var str = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";

var headerStop = str.LastIndexOf("||");
var headers = str.Substring(0, headerStop).Split(new string[1] { "||" }, StringSplitOptions.None).Skip(1).ToList();
var records = str.Substring(headerStop + 4).TrimEnd(new char[2] { ' ', '|' }).Split(new string[1] { "| |" }, StringSplitOptions.None).ToList();

var tbl = new DataTable();
headers.ForEach(h => tbl.Columns.Add(h.Trim()));
records.ForEach(r =>  tbl.Rows.Add(r.Split('|')));

This makes some assumptions but seems to work for your sample data. 这做了一些假设,但似乎适用于您的样本数据。 I'm sure if I worked at I could combine the expressions and clean it up but you'll get the idea. 我敢肯定,如果我工作,我可以结合表达并清理它,但你会得到这个想法。 It will also allow for rows that do not have the same number of cells as the header which I think is something confluence can do. 它还允许行不具有与标题相同数量的单元格,我认为这是汇合可以做的事情。

List<List<string>> table = new List<List<string>>();


var match = Regex.Match(raw, @"(?:(?:\|\|([^|]*))*\n)?");
if (match.Success)
{
    var headersWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);
    List<String> headerRow = headersWithExtra.Take(headersWithExtra.Count()-1).ToList();
    if (headerRow.Count > 0)
    {
        table.Add(headerRow);
    }
}

match = Regex.Match(raw + "\r\n", @"[^\n]*\n" + @"(?:\|([^|]*))*");
var cellsWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);

List<string> row = new List<string>();
foreach (string cell in cellsWithExtra)
{
    if (cell.Trim(' ', '\t') == "\r\n")
    {
        if (!table.Contains(row) && row.Count > 0)
        {
            table.Add(row);
        }
        row = new List<string>();
    }
    else
    {

        row.Add(cell);
    }
}

This ended up very similar to Jon Tirjan's answer, although it cuts the LINQ to a single statement (the code to replace that last one was horrifically ugly) and is a bit more extensible. 这最终与Jon Tirjan的答案非常相似,虽然它将LINQ切换为单个语句(代替最后一个的代码非常丑陋)并且更具有可扩展性。 For example, it will replace the Confluence line breaks \\\\ with a string of your choosing, you can choose to trim or not trim whitespace from around elements, etc. 例如,它将使用您选择的字符串替换Confluence换行符\\\\ ,您可以选择修剪或不修剪周围元素的空白等。

private void ParseWikiTable(string input, string newLineReplacement = " ")
{
    string separatorHeader = "||";
    string separatorRow = "| |";
    string separatorElement = "|";

    input = Regex.Replace(input, @"[ \\]{2,}", newLineReplacement);

    string inputHeader = input.Substring(0, input.LastIndexOf(separatorHeader));
    string inputContent = input.Substring(input.LastIndexOf(separatorHeader) + separatorHeader.Length);

    string[] headerArray = SimpleSplit(inputHeader, separatorHeader);
    string[][] rowArray = SimpleSplit(inputContent, separatorRow).Select(r => SimpleSplit(r, separatorElement)).ToArray();

    // do something with output data
    TestPrint(headerArray);
    foreach (var r in rowArray) { TestPrint(r); }
}

private string[] SimpleSplit(string input, string separator, bool trimWhitespace = true)
{
    input = input.Trim();
    if (input.StartsWith(separator)) { input = input.Substring(separator.Length); }
    if (input.EndsWith(separator)) { input = input.Substring(0, input.Length - separator.Length); }

    string[] segments = input.Split(new string[] { separator }, StringSplitOptions.None);
    if (trimWhitespace)
    {
        for (int i = 0; i < segments.Length; i++)
        {
            segments[i] = segments[i].Trim();
        }
    }

    return segments;
}

private void TestPrint(string[] lst)
{
    string joined = "[" + String.Join("::", lst) + "]";
    Console.WriteLine(joined);
}

Console output from your direct input string: 直接输入字符串的控制台输出:

[Owner::Action::Status::Comments] [所有者::动作::状态::评论]

[Bill::fix the lobby::In Progress::This is eary] [Bill ::修复大厅::正在进行::这是非常的]

[Joe::fix the bathroom::In progress::plumbing Electric Painting] [乔::修理浴室::正在进行中::水暖电动绘画]

[Scott::fix the roof::Complete::this is expensive] [Scott ::修复屋顶::完成::这很贵]

A generic regex solution that populate a datatable and is a little flexible with the syntax. 填充数据表的通用正则表达式解决方案,对语法有一点灵活性。

           var text = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";

        // Get Headers
        var regHeaders = new Regex(@"\|\|\s*(\w[^\|]+)", RegexOptions.Compiled);
        var headers = regHeaders.Matches(text);

        //Get Rows, based on number of headers columns
        var regLinhas = new Regex(String.Format(@"(?:\|\s*(\w[^\|]+)){{{0}}}", headers.Count));
        var rows = regLinhas.Matches(text);

        var tbl = new DataTable();

        foreach (Match header in headers)
        {
            tbl.Columns.Add(header.Groups[1].Value);
        }

        foreach (Match row in rows)
        {
            tbl.Rows.Add(row.Groups[1].Captures.OfType<Capture>().Select(col => col.Value).ToArray());
        }

Here's a solution involving regular expressions. 这是一个涉及正则表达式的解决方案。 It takes a single string as input and returns a List of headers and a List> of rows/columns. 它需要一个字符串作为输入并返回一个标题列表和一个List / of rows / columns。 It also trims white space, which may or may not be the desired behavior, so be aware of that. 它还会修剪空白区域,这可能是也可能不是所需的行为,因此请注意这一点。 It even prints things nicely :) 它甚至可以很好地打印出来:)

在此输入图像描述

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace parseWiki
{
    class Program
    {
        static void Main(string[] args)
        {
            string content = @"|| Owner || Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
            content = content.Replace(@"\\", "");
            string headerContent = content.Substring(0, content.LastIndexOf("||") + 2);
            string cellContent = content.Substring(content.LastIndexOf("||") + 2);
            MatchCollection headerMatches = new Regex(@"\|\|([^|]*)(?=\|\|)", RegexOptions.Singleline).Matches(headerContent);
            MatchCollection cellMatches = new Regex(@"\|([^|]*)(?=\|)", RegexOptions.Singleline).Matches(cellContent);

            List<string> headers = new List<string>();
            foreach (Match match in headerMatches)
            {
                if (match.Groups.Count > 1)
                {
                    headers.Add(match.Groups[1].Value.Trim());
                }
            }

            List<List<string>> body = new List<List<string>>();
            List<string> newRow = new List<string>();
            foreach (Match match in cellMatches)
            {
                if (newRow.Count > 0 && newRow.Count % headers.Count == 0)
                {
                    body.Add(newRow);
                    newRow = new List<string>();
                }
                else
                {
                    newRow.Add(match.Groups[1].Value.Trim());
                }
            }
            body.Add(newRow);

            print(headers, body);
        }

        static void print(List<string> headers, List<List<string>> body)
        {
            var CELL_SIZE = 20;

            for (int i = 0; i < headers.Count; i++)
            {
                Console.Write(headers[i].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + "  ");
            }
            Console.WriteLine("\n" + "".PadRight( (CELL_SIZE + 2) * headers.Count, '-'));

            for (int r = 0; r < body.Count; r++)
            {
                List<string> row = body[r];
                for (int c = 0; c < row.Count; c++)
                {
                    Console.Write(row[c].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + "  ");
                }
                Console.WriteLine("");
            }

            Console.WriteLine("\n\n\n");
            Console.ReadKey(false);
        }
    }

    public static class StringExt
    {
        public static string Truncate(this string value, int maxLength)
        {
            if (string.IsNullOrEmpty(value) || value.Length <= maxLength) return value;
            return value.Substring(0, maxLength - 3) + "...";

        }
    }
}

Read the input string one character at a time and use a state-machine to decide what should be done with each input character. 一次读取一个字符的输入字符串,并使用状态机来决定每个输入字符应该做什么。 This approach probably needs more code, but it will be easier to maintain and to extend than regular expressions. 这种方法可能需要更多代码,但它比正则表达式更容易维护和扩展。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM