简体   繁体   English

在ASP.NET C#中将HTML表解析为JSON

[英]Parse HTML table to JSON in ASP.NET C#

I'm having a problem. 我有问题 I have a survey builder written with different types of questions (list, checkbox, dropdowns...) and one with a table to fill. 我有一个调查构建器,其中写有不同类型的问题(列表,复选框,下拉列表...),一个问题有一个要填写的表格。 The problem is now to store the table I have 3 different tables on my DB and to create the table later is a bit complex. 现在的问题是在数据库中存储我有3个不同表的表,以后创建表有点复杂。 I'm trying to do it, using a JavaScript editor so the user can create the table as in word. 我正在尝试使用JavaScript编辑器来执行此操作,以便用户可以按字创建表。 On my code behind I receive the table markup as an String and I want to store it as JSON on the DB. 在后面的代码中,我将表标记作为字符串接收,并且希望将其作为JSON存储在数据库中。

I have this code here to convert the table to a DataSet object and then from here I can convert to XML and from here to JSON using the library JSON.NET. 我这里有这段代码将表转换为DataSet对象,然后可以使用库JSON.NET从此处转换为XML,从此处转换为JSON。 All of this is working fine, but the problem is when the table have "colspan" & "rowspan" parameters, is not working with them. 所有这些都工作正常,但是问题是当表具有“ colspan”和“ rowspan”参数时,它们不起作用。 Can you help me to finish this piece of code so everyone can have this as an example to parse HTML table to JSON in C#. 您能帮我完成这段代码吗,以便每个人都可以以此为例在C#中将HTML表解析为JSON。

protected void Page_Load(object sender, EventArgs e)
{
    string table = @"<table>
                <tbody><tr>
                    <th rowspan='4'>Project</th><th rowspan='4'>Country</th><th colspan='3' rowspan='1'>Header 1</th><th colspan='2' rowspan='1'>Header 2</th>
                </tr><tr>
                    <th colspan='1' rowspan='1'>Child Header 1</th><th colspan='2' rowspan='1'>Child Header 2</th><th colspan='1' rowspan='3'>Child Header 3</th><th colspan='1' rowspan='3'>Child Header 4</th>
                </tr><tr>
                    <th colspan='1' rowspan='2'>Child Child Header 1</th><th colspan='1' rowspan='1'>tee</th><th colspan='1' rowspan='2'>ssss</th>
                </tr><tr>
                    <th colspan='1' rowspan='1'>aas</th>
                </tr><tr>
                    <td>EUS</td><td>ES</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
                </tr><tr>
                    <td>ARP</td><td>IE</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
                </tr><tr>
                    <td>ARM</td><td>UK</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
                </tr><tr>
                    <td>SMRT</td><td>US</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
                </tr><tr>
                    <td>CM</td><td></td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
                </tr>
            </tbody></table>";

    DataSet dataSet = HtmlTableParser.ParseDataSet(table);
    StringWriter sw = new StringWriter();
    dataSet.WriteXml(sw, XmlWriteMode.IgnoreSchema);
    XmlDocument xd = new XmlDocument();
    xd.LoadXml(sw.ToString());

    string jsonText = JsonConvert.SerializeXmlNode(xd).Replace("_x0020_", " ");
}

/// <summary>
/// HtmlTableParser parses the contents of an html string into a System.Data DataSet or DataTable.
/// </summary>
public class HtmlTableParser
{
    private const RegexOptions ExpressionOptions = RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.IgnoreCase;
    private const string CommentPattern = "<!--(.*?)-->";
    private const string TablePattern = "<table[^>]*>(.*?)</table>";
    private const string HeaderPattern = "<th[^>]*>(.*?)</th>";
    private const string RowPattern = "<tr[^>]*>(.*?)</tr>";
    private const string CellPattern = "<td[^>]*>(.*?)</td>";

    /// <summary>
    /// Given an HTML string containing n table tables, parse them into a DataSet containing n DataTables.
    /// </summary>
    /// <param name="html">An HTML string containing n HTML tables</param>
    /// <returns>A DataSet containing a DataTable for each HTML table in the input HTML</returns>
    public static DataSet ParseDataSet(string html)
    {
        DataSet dataSet = new DataSet();
        MatchCollection tableMatches = Regex.Matches(
            WithoutComments(html),
            TablePattern,
            ExpressionOptions);

        foreach (Match tableMatch in tableMatches)
            dataSet.Tables.Add(ParseTable(tableMatch.Value));

        return dataSet;
    }

    /// <summary>
    /// Given an HTML string containing a single table, parse that table to form a DataTable.
    /// </summary>
    /// <param name="tableHtml">An HTML string containing a single HTML table</param>
    /// <returns>A DataTable which matches the input HTML table</returns>
    public static DataTable ParseTable(string tableHtml)
    {
        string tableHtmlWithoutComments = WithoutComments(tableHtml);

        DataTable dataTable = new DataTable();

        MatchCollection rowMatches = Regex.Matches(
            tableHtmlWithoutComments,
            RowPattern,
            ExpressionOptions);

        dataTable.Columns.AddRange(tableHtmlWithoutComments.Contains("<th")
                                       ? ParseColumns(tableHtml)
                                       : GenerateColumns(rowMatches));

        ParseRows(rowMatches, dataTable);

        return dataTable;
    }

    /// <summary>
    /// Strip comments from an HTML stirng
    /// </summary>
    /// <param name="html">An HTML string potentially containing comments</param>
    /// <returns>The input HTML string with comments removed</returns>
    private static string WithoutComments(string html)
    {
        return Regex.Replace(html, CommentPattern, string.Empty, ExpressionOptions);
    }

    /// <summary>
    /// Add a row to the input DataTable for each row match in the input MatchCollection
    /// </summary>
    /// <param name="rowMatches">A collection of all the rows to add to the DataTable</param>
    /// <param name="dataTable">The DataTable to which we add rows</param>
    private static void ParseRows(MatchCollection rowMatches, DataTable dataTable)
    {
        foreach (Match rowMatch in rowMatches)
        {
            // if the row contains header tags don't use it - it is a header not a row
            if (!rowMatch.Value.Contains("<th"))
            {
                DataRow dataRow = dataTable.NewRow();

                MatchCollection cellMatches = Regex.Matches(
                    rowMatch.Value,
                    CellPattern,
                    ExpressionOptions);

                for (int columnIndex = 0; columnIndex < cellMatches.Count; columnIndex++)
                    dataRow[columnIndex] = cellMatches[columnIndex].Groups[1].ToString();

                dataTable.Rows.Add(dataRow);
            }
        }
    }

    /// <summary>
    /// Given a string containing an HTML table, parse the header cells to create a set of DataColumns
    /// which define the columns in a DataTable.
    /// </summary>
    /// <param name="tableHtml">An HTML string containing a single HTML table</param>
    /// <returns>A set of DataColumns based on the HTML table header cells</returns>
    private static DataColumn[] ParseColumns(string tableHtml)
    {
        MatchCollection headerMatches = Regex.Matches(
            tableHtml,
            HeaderPattern,
            ExpressionOptions);

        return (from Match headerMatch in headerMatches
                select new DataColumn(headerMatch.Groups[1].ToString())).ToArray();
    }

    /// <summary>
    /// For tables which do not specify header cells we must generate DataColumns based on the number
    /// of cells in a row (we assume all rows have the same number of cells).
    /// </summary>
    /// <param name="rowMatches">A collection of all the rows in the HTML table we wish to generate columns for</param>
    /// <returns>A set of DataColumns based on the number of celss in the first row of the input HTML table</returns>
    private static DataColumn[] GenerateColumns(MatchCollection rowMatches)
    {
        int columnCount = Regex.Matches(
            rowMatches[0].ToString(),
            CellPattern,
            ExpressionOptions).Count;

        return (from index in Enumerable.Range(0, columnCount)
                select new DataColumn("Column " + Convert.ToString(index))).ToArray();
    }
}

This smells like very bad things . 这闻起来很不好 Why are using regular expressions to clean up HTML? 为什么要使用正则表达式清理HTML? The most famous answer ever on Stack Overflow pertains to this very thing. 关于Stack Overflow的最著名的答案与此有关。 Do not do this. 不要这样做。

Your requirements for parsing this HTML into this DataTable do not really make sense to me. 您对将此HTML解析到此DataTable中的要求对我来说真的没有任何意义。 What are you doing with the HTML table later than requires you to parse it into JSON now? 在需要您立即将其解析为JSON之后,您对HTML表的处理晚了吗? From your question, it sounds like this is a fairly straightforward survey editor. 从您的问题来看,听起来这是一个相当简单的调查编辑器。

If you really need to parse this data into objects and store each individual field from this table into the database, please tell us why. 如果您确实需要将这些数据解析为对象并将该表中的每个字段存储到数据库中,请告诉我们原因。 It's possible to do something like this, but you I seriously urge you to reconsider parsing HTML. 可以执行类似的操作,但是我强烈建议您重新考虑解析HTML。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM