简体   繁体   中英

In C#, what is the best way to parse this WIKI markup?

I need to take data that I am reading in from a WIKI markup page and store it as a table structure. I am trying to figure out how to properly parse the below markup syntax into some table data structure in C#

Here is an example table:

 || Owner || Action || Status || Comments ||
 | Bill | Fix the lobby | In Progress | This is easy |
 | Joe | Fix the bathroom | In Progress | Plumbing \\
 \\
  Electric \\
 \\
 Painting \\
 \\
 \\ | 
 | Scott | Fix the roof | Complete | This is expensive |

and here is how it comes in directly:

|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive| 

So as you can see:

  • The column headers have "||" as the separator
  • A row columns have a separator or "|"
  • A row might span multiple lines (as in the second data row example above) so i would have to keep reading until I hit the same number of "|" (cols) that I have in the header row.

I tried reading in line by line and then concatenating lines that had "\\" in between then but that seemed a bit hacky.

I also tried to simply read in as a full string and then just parse by "||" first and then keep reading until I hit the same number of "|" and then go to the next row. This seemed to work but it feel like there might be a more elegant way using regular expressions or something similar.

Can anyone suggest the correct way to parse this data?

I have largely replaced the previous answer, due to the fact that the format of the input after your edit is substantially different from the one posted before. This leads to a somewhat different solution.

Because there are no longer any line breaks after a row, the only way to determine for sure where a row ends, is to require that each row has the same number of columns as the table header. That is at least if you don't want to rely on some potentially fragile white space convention present in the one and only provided example string (ie that the row separator is the only | not preceded by a space). Your question at least does not provide this as the specification for a row delimiter.

The below "parser" provides at least the error handling validity checks that can be derived from your format specification and example string and also allows for tables that have no rows. The comments explain what it is doing in basic steps.

public class TableParser
{
    const StringSplitOptions SplitOpts = StringSplitOptions.None;
    const string RowColSep = "|";
    static readonly string[] HeaderColSplit = { "||" };
    static readonly string[] RowColSplit = { RowColSep };
    static readonly string[] MLColSplit = { @"\\" };

    public class TableRow
    {
        public List<string[]> Cells;
    }

    public class Table
    {
        public string[] Header;
        public TableRow[] Rows;
    }

    public static Table Parse(string text)
    {
        // Isolate the header columns and rows remainder.
        var headerSplit = text.Split(HeaderColSplit, SplitOpts);
        Ensure(headerSplit.Length > 1, "At least 1 header column is required in the input");

        // Need to check whether there are any rows.
        var hasRows = headerSplit.Last().IndexOf(RowColSep) >= 0;
        var header = headerSplit.Skip(1)
            .Take(headerSplit.Length - (hasRows ? 2 : 1))
            .Select(c => c.Trim())
            .ToArray();

        if (!hasRows) // If no rows for this table, we are done.
            return new Table() { Header = header, Rows = new TableRow[0] };

        // Get all row columns from the remainder.
        var rowsCols = headerSplit.Last().Split(RowColSplit, SplitOpts);

        // Require same amount of columns for a row as the header.
        Ensure((rowsCols.Length % (header.Length + 1)) == 1, 
            "The number of row colums does not match the number of header columns");
        var rows = new TableRow[(rowsCols.Length - 1) / (header.Length + 1)];

        // Fill rows by sequentially taking # header column cells 
        for (int ri = 0, start = 1; ri < rows.Length; ri++, start += header.Length + 1)
        {
            rows[ri] = new TableRow() { 
                Cells = rowsCols.Skip(start).Take(header.Length)
                    .Select(c => c.Split(MLColSplit, SplitOpts).Select(p => p.Trim()).ToArray())
                    .ToList()
            };
        };

        return new Table { Header = header, Rows = rows };
    }

    private static void Ensure(bool check, string errorMsg)
    {
        if (!check)
            throw new InvalidDataException(errorMsg);
    }
}

When used like this:

public static void Main(params string[] args)
{
        var wikiLine = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
        var table = TableParser.Parse(wikiLine);

        Console.WriteLine(string.Join(", ", table.Header));
        foreach (var r in table.Rows)
            Console.WriteLine(string.Join(", ", r.Cells.Select(c => string.Join(Environment.NewLine + "\t# ", c))));
}

It will produce the below output:

产量

Where "\\t# " represents a newline caused by the presence of \\\\ in the input.

Here's a solution which populates a DataTable. It does require a litte bit of data massaging (Trim), but the main parsing is Splits and Linq.

var str = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";

var headerStop = str.LastIndexOf("||");
var headers = str.Substring(0, headerStop).Split(new string[1] { "||" }, StringSplitOptions.None).Skip(1).ToList();
var records = str.Substring(headerStop + 4).TrimEnd(new char[2] { ' ', '|' }).Split(new string[1] { "| |" }, StringSplitOptions.None).ToList();

var tbl = new DataTable();
headers.ForEach(h => tbl.Columns.Add(h.Trim()));
records.ForEach(r =>  tbl.Rows.Add(r.Split('|')));

This makes some assumptions but seems to work for your sample data. I'm sure if I worked at I could combine the expressions and clean it up but you'll get the idea. It will also allow for rows that do not have the same number of cells as the header which I think is something confluence can do.

List<List<string>> table = new List<List<string>>();


var match = Regex.Match(raw, @"(?:(?:\|\|([^|]*))*\n)?");
if (match.Success)
{
    var headersWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);
    List<String> headerRow = headersWithExtra.Take(headersWithExtra.Count()-1).ToList();
    if (headerRow.Count > 0)
    {
        table.Add(headerRow);
    }
}

match = Regex.Match(raw + "\r\n", @"[^\n]*\n" + @"(?:\|([^|]*))*");
var cellsWithExtra = match.Groups[1].Captures.Cast<Capture>().Select(c=>c.Value);

List<string> row = new List<string>();
foreach (string cell in cellsWithExtra)
{
    if (cell.Trim(' ', '\t') == "\r\n")
    {
        if (!table.Contains(row) && row.Count > 0)
        {
            table.Add(row);
        }
        row = new List<string>();
    }
    else
    {

        row.Add(cell);
    }
}

This ended up very similar to Jon Tirjan's answer, although it cuts the LINQ to a single statement (the code to replace that last one was horrifically ugly) and is a bit more extensible. For example, it will replace the Confluence line breaks \\\\ with a string of your choosing, you can choose to trim or not trim whitespace from around elements, etc.

private void ParseWikiTable(string input, string newLineReplacement = " ")
{
    string separatorHeader = "||";
    string separatorRow = "| |";
    string separatorElement = "|";

    input = Regex.Replace(input, @"[ \\]{2,}", newLineReplacement);

    string inputHeader = input.Substring(0, input.LastIndexOf(separatorHeader));
    string inputContent = input.Substring(input.LastIndexOf(separatorHeader) + separatorHeader.Length);

    string[] headerArray = SimpleSplit(inputHeader, separatorHeader);
    string[][] rowArray = SimpleSplit(inputContent, separatorRow).Select(r => SimpleSplit(r, separatorElement)).ToArray();

    // do something with output data
    TestPrint(headerArray);
    foreach (var r in rowArray) { TestPrint(r); }
}

private string[] SimpleSplit(string input, string separator, bool trimWhitespace = true)
{
    input = input.Trim();
    if (input.StartsWith(separator)) { input = input.Substring(separator.Length); }
    if (input.EndsWith(separator)) { input = input.Substring(0, input.Length - separator.Length); }

    string[] segments = input.Split(new string[] { separator }, StringSplitOptions.None);
    if (trimWhitespace)
    {
        for (int i = 0; i < segments.Length; i++)
        {
            segments[i] = segments[i].Trim();
        }
    }

    return segments;
}

private void TestPrint(string[] lst)
{
    string joined = "[" + String.Join("::", lst) + "]";
    Console.WriteLine(joined);
}

Console output from your direct input string:

[Owner::Action::Status::Comments]

[Bill::fix the lobby::In Progress::This is eary]

[Joe::fix the bathroom::In progress::plumbing Electric Painting]

[Scott::fix the roof::Complete::this is expensive]

A generic regex solution that populate a datatable and is a little flexible with the syntax.

           var text = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";

        // Get Headers
        var regHeaders = new Regex(@"\|\|\s*(\w[^\|]+)", RegexOptions.Compiled);
        var headers = regHeaders.Matches(text);

        //Get Rows, based on number of headers columns
        var regLinhas = new Regex(String.Format(@"(?:\|\s*(\w[^\|]+)){{{0}}}", headers.Count));
        var rows = regLinhas.Matches(text);

        var tbl = new DataTable();

        foreach (Match header in headers)
        {
            tbl.Columns.Add(header.Groups[1].Value);
        }

        foreach (Match row in rows)
        {
            tbl.Rows.Add(row.Groups[1].Captures.OfType<Capture>().Select(col => col.Value).ToArray());
        }

Here's a solution involving regular expressions. It takes a single string as input and returns a List of headers and a List> of rows/columns. It also trims white space, which may or may not be the desired behavior, so be aware of that. It even prints things nicely :)

在此输入图像描述

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

namespace parseWiki
{
    class Program
    {
        static void Main(string[] args)
        {
            string content = @"|| Owner || Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
            content = content.Replace(@"\\", "");
            string headerContent = content.Substring(0, content.LastIndexOf("||") + 2);
            string cellContent = content.Substring(content.LastIndexOf("||") + 2);
            MatchCollection headerMatches = new Regex(@"\|\|([^|]*)(?=\|\|)", RegexOptions.Singleline).Matches(headerContent);
            MatchCollection cellMatches = new Regex(@"\|([^|]*)(?=\|)", RegexOptions.Singleline).Matches(cellContent);

            List<string> headers = new List<string>();
            foreach (Match match in headerMatches)
            {
                if (match.Groups.Count > 1)
                {
                    headers.Add(match.Groups[1].Value.Trim());
                }
            }

            List<List<string>> body = new List<List<string>>();
            List<string> newRow = new List<string>();
            foreach (Match match in cellMatches)
            {
                if (newRow.Count > 0 && newRow.Count % headers.Count == 0)
                {
                    body.Add(newRow);
                    newRow = new List<string>();
                }
                else
                {
                    newRow.Add(match.Groups[1].Value.Trim());
                }
            }
            body.Add(newRow);

            print(headers, body);
        }

        static void print(List<string> headers, List<List<string>> body)
        {
            var CELL_SIZE = 20;

            for (int i = 0; i < headers.Count; i++)
            {
                Console.Write(headers[i].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + "  ");
            }
            Console.WriteLine("\n" + "".PadRight( (CELL_SIZE + 2) * headers.Count, '-'));

            for (int r = 0; r < body.Count; r++)
            {
                List<string> row = body[r];
                for (int c = 0; c < row.Count; c++)
                {
                    Console.Write(row[c].Truncate(CELL_SIZE).PadRight(CELL_SIZE) + "  ");
                }
                Console.WriteLine("");
            }

            Console.WriteLine("\n\n\n");
            Console.ReadKey(false);
        }
    }

    public static class StringExt
    {
        public static string Truncate(this string value, int maxLength)
        {
            if (string.IsNullOrEmpty(value) || value.Length <= maxLength) return value;
            return value.Substring(0, maxLength - 3) + "...";

        }
    }
}

Read the input string one character at a time and use a state-machine to decide what should be done with each input character. This approach probably needs more code, but it will be easier to maintain and to extend than regular expressions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM