简体   繁体   English

将 HTML 字符串变成有组织的 Object

[英]Turn HTML string into a organized Object

Lang: Node JS朗:节点 JS

I'm using a Texteditor and I get the output string like this我正在使用文本编辑器,我得到了这样的 output 字符串

<p>This is <strong>a <a href="#">test</a></strong></p>

but could be different HTML-tags like H1, H2, etc. but nothing more special than actual HTML text tags.但可能是不同的 HTML 标签,如 H1、H2 等,但没有什么比实际的 HTML 文本标签更特别的了。

Now I want to turn that string into an object that I can work with and send to my database.现在我想将该字符串转换为 object,我可以使用它并将其发送到我的数据库。 So the perfect way would it be transformed into something like this...所以完美的方式是将它变成这样的东西......

[{type: "text", text: "This is ", bold: false}, {type: "text", text: "a  ", bold: true}, {type: "link", text: "test", bold: true, href: "#}]

and so on.等等。

I tried the Regex approach and split it by and do all sorts of logic to turn into a structured object but that can't be the best way to do it since it'll fail if I would in the future write <h1>Test</h1> in the middle of the text as an example.我尝试了正则表达式方法并将其拆分并执行各种逻辑以变成结构化的 object 但这不是最好的方法,因为如果我将来编写<h1>Test</h1>它会失败<h1>Test</h1>中间的文字为例。

How would you approach this?你会如何处理这个问题?

If you want to go easy, jsdom or htmlparser2 and domhandler would help doing that.如果你想 go 容易, jsdomhtmlparser2domhandler会帮助做到这一点。 For example, using htmlparser2 and domhandler (from some of my apps ):例如,使用htmlparser2domhandler (来自我的一些应用程序):

// Parsers helpers
import { Parser } from 'htmlparser2';
import { DomHandler } from 'domhandler';

// Get all text contents, recursively
const getAllText = (node) => {
  return node.children.map( n => {
    if (n.type === 'text') {
      return n.data.trim("\n\r ");
    }

    // Discard `small` tags
    if (n.name === 'small') {
      return ''
    }

    return getAllText(n);
  }).join('')
}

// Parses HTML data containing a UL/LI/A tree
const parseMenu = (data) => {

  const parseLink = (link) => {
    const name = getAllText(link);
    const code = link.attribs['data-value']?.trim("\n\r ");
    return {
      name,
      ...(code ? {code} : {}),
    }
  }

  const parseLi = (li) => {
    const ul = li.children.find(({type, name}) => type === 'tag' && name === 'ul' );
    const link = li.children.find(({type, name}) => type === 'tag' && name === 'a' );
    return {
      ...(link ? parseLink(link) : {}),
      ...(ul ? {children:  parseUl(ul)} : {}),
    }
  }

  const parseUl = (ul) => {
    return ul.children.filter(({type, name}) => type === 'tag' && name === 'li' ).map( child => {
      return parseLi(child);
    });
  }

  let result;
  const handler = new DomHandler( (error, dom) => {
    if (error) {
      // Handle error
    } else {
      // Parsing completed, do something
      result = parseUl(dom[0]);
    }
  });

  const parser = new Parser(handler);
  parser.write(data);
  parser.end();
  return result;
}

Use cheerio library (or any other html parser library of your choise) and operate The "DOM Node" object as you wish.使用cheerio 库(或您选择的任何其他html 解析器库)并根据需要操作“DOM 节点”object。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM