如何在两个独立标签之间获取 HTML 元素

Question

I am using puppeteer.我正在使用 puppeteer。 I have a situation where I need to get content between two tags which are not in a parent-child relationship.我有一种情况，我需要在两个不属于父子关系的标签之间获取内容。

<h1>neverchangeA<h1>
<span>abc<span>
<span>abc2<span>
<h1>neverchangeB<h1>

Expected elements预期元素

<span>abc<span>
<span>abc2<span>

In simple, I need something like a regex similar to this:简单来说，我需要类似这样的正则表达式：

regex.matchBetween(<h1>neverchangeA<h1>,<h1>neverchangeB<h1>)

Answer 1

Getting the sibling of an elementHandle in Puppeteer explains how to get the previous sibling of an element with puppeteer. 在 Puppeteer 中获取 elementHandle 的兄弟姐妹解释了如何使用 puppeteer 获取元素的前一个兄弟姐妹。 There is a similar function to get the next sibling of an element.有一个类似的 function 来获取元素的下一个兄弟。 You can apply this to your situation by writing a loop that starts with the first <h1> element then repeatedly gets the next sibling until you reach the second <h1> element.您可以通过编写一个从第一个<h1>元素开始然后重复获取下一个兄弟元素直到到达第二个<h1>元素的循环来将此应用于您的情况。

Answer 2

You could do this with JS and the evaluate method.您可以使用 JS 和评估方法来做到这一点。

https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pageevaluatepagefunction-args https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pageevaluatepagefunction-args

This example returns the HTML of the desired elements as a string.此示例将所需元素的 HTML 作为字符串返回。

const result = await page.evaluate(() => {
  const h1s = [...document.querySelectorAll('h1')]
  const neverChangeA = h1s.find(elem => elem.innerText === "neverchangeA")
  if(neverChangeA){
    const siblings = [...neverChangeA.parentNode.children]
    const indexOfFirstH1 = siblings.findIndex(elem => elem.innerText === "neverchangeA")
    const indexOfSecondH1 =  siblings.findIndex(elem => elem.innerText === "neverchangeB")
    const betweenELems = siblings.slice(indexOfFirstH1 + 1, indexOfSecondH1)
    const htmlOfElems = betweenELems.map(elem => elem.outerHTML)
    const result = htmlOfElems.join('')
    return Promise.resolve(result)
  }
  else {
    return Promise.resolve(null)
 }
})
console.log(result)

Answer 3

Solution using XPath使用 XPath 的解决方案

This is a good use case for XPath .这是XPath的一个很好的用例。 The following query looks for span elements that have an h1 tag with content neverchangeA before them and an h1 tag with the content neverchangeB after them:以下查询查找在它们之前具有内容neverchangeB的h1标记和在它们之后具有内容neverchangeA的h1标记的span元素：

//span[preceding::h1="neverchangeA" and following::h1="neverchangeB"]

To use an XPath expression within puppeteer, use page.$x .要在 puppeteer 中使用 XPath 表达式，请使用page.$x 。

Code Sample代码示例

const spans = await page.$x('//span[preceding::h1="neverchangeA" and following::h1="neverchangeB"]');

Answer 4

You should use regex .你应该使用正则表达式。 This: <h1>.*<h1> will select the h1 tag and whatever there is between the tag.这： <h1>.*<h1>将 select h1 标签和标签之间的任何东西。 One way is to remove the result of this from the text, and you will have the result you need.一种方法是从文本中删除此结果，您将获得所需的结果。

如何在两个独立标签之间获取 HTML 元素

问题描述

4 个解决方案

解决方案1
2 2020-05-08 15:43:53

解决方案2
1 已采纳 2020-05-08 15:56:30

解决方案3
1 2020-05-08 16:19:49

Solution using XPath使用 XPath 的解决方案

解决方案4
0 2020-05-08 15:39:39

如何在两个独立标签之间获取 HTML 元素

问题描述

4 个解决方案

解决方案1 2 2020-05-08 15:43:53

解决方案2 1 已采纳 2020-05-08 15:56:30

解决方案3 1 2020-05-08 16:19:49

Solution using XPath使用 XPath 的解决方案

解决方案4 0 2020-05-08 15:39:39

解决方案1
2 2020-05-08 15:43:53

解决方案2
1 已采纳 2020-05-08 15:56:30

解决方案3
1 2020-05-08 16:19:49

解决方案4
0 2020-05-08 15:39:39