提取之間的文本塊 <p> 標簽由 <br>

Question

我想解析以下示例中的所有文本塊（TEXT CONTENT，BODY CONTENT和EXTRA CONTENT）。 您可能會注意到，所有這些文本塊在每個“ p”標記內的位置都不同。

<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>

我想以表格格式顯示最終結果：

       Col1             Col2               Col3
TITLE CONTENT #1     BODY CONTENT #1     EXTRA CONTENT #1
TITLE CONTENT #2     BODY CONTENT #2     EXTRA CONTENT #2
TITLE CONTENT #3     BODY CONTENT #3     EXTRA CONTENT #3

我試過了

 for i in soup.find_all('p'):
     title = i.find('strong')
     if not isinstance(title.nextSibling, NavigableString):
         body= title.nextSibling.nextSibling
         extra= body.nextSibling.nextSibling
     else:
         if len(title.nextSibling) > 3:
             body= title.nextSibling
             extra= body.nextSibling.nextSibling
         else:
             body= title.nextSibling.nextSibling.nextSibling
             extra= body.nextSibling.nextSibling

但這看起來效率不高。 我想知道是否有人有更好的解決方案？
任何幫助將不勝感激！

謝謝！

Answer 1

重要的是要注意.next_sibling也可以工作，因為您可能需要收集多個文本節點，所以您必須使用一些邏輯來知道調用它多少次。 在此示例中，我發現僅瀏覽后代會更容易，因為后代注意到了一些重要特征，這些特征標志着我要做一些不同的事情。

您只需要分解要抓取的內容的特征即可。 在這種簡單的情況下，我們知道：

當我們看到strong元素時，我們想要捕獲“標題”。
當我們看到第一個br元素時，我們想開始捕獲“ content”。
當我們看到第二個br元素時，我們要開始捕獲“額外內容”。

我們可以：

定位plans類以獲取所有計划。
然后，我們可以遍歷plans的所有后代節點。
如果看到標簽，請查看它是否符合上述條件之一，並准備在正確的容器中捕獲文本節點。
如果我們看到一個文本節點，並且已經准備好一個容器，請存儲該文本。
刪除不必要的前導和尾隨空白，並存儲計划的數據。

from bs4 import BeautifulSoup as bs
from bs4 import Tag, NavigableString

html = """
<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>
"""

soup = bs(html, 'html.parser')

content = []

# Iterate through all the plans
for plans in soup.select('.plans'):
    # Lists that will hold the text nodes of interest
    title = []
    body = []
    extra = []

    current = None  # Reference to  one of the above lists to store data
    br = 0  # Count number of br tags

    # Iterate through all the descendant nodes of a plan
    for node in plans.descendants:
        # See if the node is a Tag/Element
        if isinstance(node, Tag):
            if node.name == 'strong':
                # Strong tags/elements contain our title
                # So set the current container for text to the the title list
                current = title
            elif node.name == 'br':
                # We've found a br Tag/Element
                br += 1
                if br == 1:
                    # If this is the first, we need to set the current
                    # container for text to the body list
                    current = body
                elif br == 2:
                    # If this is the second, we need to set the current
                    # container for text to the extra list
                    current = extra
        elif isinstance(node, NavigableString) and current is not None:
            # We've found a navigable string (not a tag/element), so let's
            # store the text node in the current list container.
            # NOTE: You may have to filter out things like HTML comments in a real world example.
            current.append(node)

    # Store the captured title, body, and extra text for the current plan.
    # For each list, join the text into one string and strip leading and trailing whitespace
    # from each entry in the row.
    content.append([''.join(entry).strip() for entry in (title, body, extra)])

print(content)

然后，您可以根據需要打印數據，但是應該以一種很好的邏輯方式捕獲數據，如下所示：

[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]

有多種方法可以做到這一點，這只是其中之一。

Answer 2

使用切片的另一種方式，假設您的列表不可變

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"), "html.parser")

def slicing(l):
     new_list = []
     for i in range(0,len(l),3):
             new_list.append(l[i:i+3])
     return new_list

result = slicing(list(soup.stripped_strings))
print(result)

產量

[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]

Answer 3

在這種情況下，可以使用帶有separator=參數的BeautifulSoup的get_text()方法：

data = '''<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>'''


from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

print('{: ^25}{: ^25}{: ^25}'.format('Col1', 'Col2', 'Col3'))
for p in [[i.strip() for i in p.get_text(separator='|').split('|') if i.strip()] for p in soup.select('p.plans')]:
    print(''.join('{: ^25}'.format(i) for i in p))

打印：

      Col1                     Col2                     Col3           
TITLE CONTENT #1          BODY CONTENT #1         EXTRA CONTENT #1     
TITLE CONTENT #2          BODY CONTENT #2         EXTRA CONTENT #2     
TITLE CONTENT #3          BODY CONTENT #3         EXTRA CONTENT #3

提取之間的文本塊 <p> 標簽由 <br>

問題描述

3 個解決方案

解決方案1
1 已采納 2019-07-08 21:59:48

解決方案2
0 2019-07-08 22:35:13

解決方案3
0 2019-07-09 05:25:52

提取之間的文本塊 <p> 標簽由 <br>

問題描述

3 個解決方案

解決方案1 1 已采納 2019-07-08 21:59:48

解決方案2 0 2019-07-08 22:35:13

解決方案3 0 2019-07-09 05:25:52

解決方案1
1 已采納 2019-07-08 21:59:48

解決方案2
0 2019-07-08 22:35:13

解決方案3
0 2019-07-09 05:25:52