根据特定标签将HTML字符串分成多个部分？

Question

我有一个表示HTML代码段的字符串，如下所示：

const bookString = "<h1>Chapter 1: The Beginning</h1>
<p>It was a dark and stormy night...</p>
<p>Tom ran up the stairs...</p>
<p>A shot rang out!</p>

<h1>Chapter 2: A Day at the Zoo</h1>
<p>The door swung open...</p>"

您明白了，这是我只希望看到h1，p，em / strong / i / b标签的书。 （这来自Mammoth库，该库使用Word文档并给我一个HTML字符串。）我想编写一些JS，根据章节将其拆分，如下所示：

const chapters = [
  {
    title: "The Beginning",
    content: 
      "<p>It was a dark and stormy night...</p>
      <p>Tom ran up the stairs...</p>
      <p>A shot rang out!</p>"
    ]
  }
];

然后，我可以将其传递给电子书生成库。

我应该使用Cheerio这样的HTML解析库来执行此操作吗？ 我不太清楚选择内容，例如“对于每个h1 ，保存一个标题，然后为该h1之后的每个p ，推送到数组...”，或者我应该使用正则表达式，尽管通常的建议是不要在HTML上使用正则表达式？

Answer 1

一种方法是使用一系列split来对字符串进行排序并将其分成几部分，然后进行一些清理工作，并通过映射初始的“残破”字符串并在内部进行再次分裂以获得（干净的）标题来构建新的Array。和内容

 var bookString = `<h1>Chapter 1: The Beginning</h1> <p>It was a dark and stormy night...</p> <p>Tom ran up the stairs...</p> <p>A shot rang out!</p> <h1>Chapter 2: A Day at the Zoo</h1> <p>The door swung open...</p>`; var chapters = bookString.split('<h1>').filter(n => n).map(text => { var cut = text.replace(/\\n/g, '').split(': ')[1].split('</h1>'); return { title : cut[0], content : cut[1] } }); console.log(chapters);

Answer 2

如果要使用Cheerio，则可以使用nextUntil()方法将所有元素最多增加到一个通过选择器标识的元素

//get all elements until the next h1 is encountered
$('h1').nextUntil('h1')

然后，您可以使用它在h1集合上进行map()获取每组内容，并最终创建您的对象

const chapters = $('h1').map((index,h1)=>{
  let content = $(h1).nextUntil('h1').map((index,p)=>$.html(p)).get().join('');
  return {
    title:$(h1).html(),
    content:content
  };
}).get();

复制演示

根据特定标签将HTML字符串分成多个部分？

问题描述

2 个解决方案

解决方案1
2 2018-06-30 12:34:49

解决方案2
2 已采纳 2018-06-30 13:16:09

根据特定标签将HTML字符串分成多个部分？

问题描述

2 个解决方案

解决方案1 2 2018-06-30 12:34:49

解决方案2 2 已采纳 2018-06-30 13:16:09

解决方案1
2 2018-06-30 12:34:49

解决方案2
2 已采纳 2018-06-30 13:16:09