简体   繁体   English

如何解析HTML节点

[英]How to parse HTML nodes

My Website flow. 我的网站流程。

  1. Authenticated user will upload docx. 经过身份验证的用户将上传docx。
  2. I am using OpenXmlPowerTools API to convert this docx to HTML 我正在使用OpenXmlPowerTools API将此docx转换为HTML
  3. Save the file 保存文件
  4. Save each node of the html page into database. 将html页面的每个节点保存到数据库中。

Database:- 数据库:-

tblNodeCollection
  • NodeId 节点编号
  • Node Type (Expected values - <p> , <h1> , <h3> , <table> ) 节点类型(期望值- <p><h1><h3><table>
  • NodeContent (Expected Value - <p> This is p content </p> NodeContent(期望值- <p> This is p content </p>

No issues till Step #3 . 在步骤#3之前没有问题。 But I am clueless on how to save the nodes collection into the table. 但是我对如何将节点集合保存到表一无所知

I googled & found HTMLAgiiltiyPack but don't know much about it. 我用Google搜索并找到了HTMLAgiiltiyPack但是对此了解不多。

using DocumentFormat.OpenXml.Packaging;
using HtmlAgilityPack;
using OpenXmlPowerTools;

namespace ExportData 
{
public class ExportHandler 
{
public void GenerateHTML()
    {
        byte[] byteArray = File.ReadAllBytes(@"d:\test.docx");
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument doc =
                WordprocessingDocument.Open(memoryStream, true))
            {
                HtmlConverterSettings settings = new HtmlConverterSettings()
                {
                    PageTitle = "My Page Title"
                };
                XElement html = HtmlConverter.ConvertToHtml(doc, settings);

                File.WriteAllText(@"d:\Test.html", html.ToStringNewLineOnAttributes());


            }
        }

        //now how do I proceed from here
    }
 }

Any type of help/guidance highly appreciated. 任何类型的帮助/指导都受到高度赞赏。

From the discussion we've had in the comments, and the part you seem to be stuck on, I'd recommend the following: 从我们在评论中进行的讨论以及您似乎坚持的那一部分开始,我建议以下内容:

This Question here on SO may provide some help with how to convert to html. SO上的此问题可能为如何转换为html提供帮助。

Of course, you still face the issue of needing to be able to split each page (as you mentioned in the comments), you may be able to export each page to html individually. 当然,您仍然面临需要能够拆分每个页面的问题(如您在评论中提到的那样),您可能能够将每个页面分别导出为html。

As for your database structure, I'd recommend something akin to: 至于您的数据库结构,我建议类似以下内容:

[Document Table]
  - Document ID
  - Document Name
  - Any other data you need per-document

[Node Table]
  - Node ID
  - Document ID (foreign key)
  - Node Content (string)

Make sure you've got sensible indexes on the node table as you're going to potentially be seeking across thousands if not millions of rows as time goes on (particularly one on the document id). 确保在节点表上有明智的索引,因为随着时间的推移,您可能会在数千行(甚至不是几百万行)中进行搜索(尤其是文档ID中的一个)。

It might also be useful to have an index property against each node (eg a bigint position) so you can reconstitute a document by putting the nodes back together in order. 对每个节点具有索引属性(例如,bigint位置)也可能很有用,因此您可以通过将节点按顺序放回一起来重新构造文档。

Overall though, my advice would be to try and make your boss see reason and really push against this silly design decision. 总体而言,我的建议是尝试让您的老板了解原因,并真正反对这一愚蠢的设计决策。

Here is the simplified procedure how to parse html and save it to database. 这是简化的过程,如何解析html并将其保存到数据库。 I hope this will help you and/or give you an idea how to solve your problem 我希望这会帮助您和/或给您一个解决问题的方法

        HtmlWeb h = new HtmlWeb();
        HtmlAgilityPack.HtmlDocument doc = h.Load("http://stackoverflow.com/questions/41183837/how-to-store-html-nodes-into-database");
        HtmlNodeCollection tableNodes = doc.DocumentNode.SelectNodes("//table");
        HtmlNodeCollection h1Nodes = doc.DocumentNode.SelectNodes("//h1");
        HtmlNodeCollection pNodes = doc.DocumentNode.SelectNodes("//p");
        //get other nodes here

        foreach (var pNode in pNodes)
        {
            string id = pNode.Id;
            string content = pNode.InnerText;
            string tag = pNode.Name;

            //do other stuff here and then save to database

            //just an example...
            SqlConnection conn = new SqlConnection("here goes conection string");
            SqlCommand cmd = new SqlCommand();
            cmd.Connection = conn;
            cmd.CommandText = "INSERT INTO tblNodeCollection (Tag, Id, Content) VALUES (@tag, @id, @content)";
            cmd.Parameters.Add("@tag", tag);
            cmd.Parameters.Add("@id", id);
            cmd.Parameters.Add("@content", content);

            cmd.ExecuteNonQuery();
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM