簡體   English   中英

提取之間的文本塊 <p> 標簽由 <br>

[英]Extract text blocks between <p> tags separate by <br>

我想解析以下示例中的所有文本塊(TEXT CONTENT,BODY CONTENT和EXTRA CONTENT)。 您可能會注意到,所有這些文本塊在每個“ p”標記內的位置都不同。

<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>

我想以表格格式顯示最終結果:

       Col1             Col2               Col3
TITLE CONTENT #1     BODY CONTENT #1     EXTRA CONTENT #1
TITLE CONTENT #2     BODY CONTENT #2     EXTRA CONTENT #2
TITLE CONTENT #3     BODY CONTENT #3     EXTRA CONTENT #3

我試過了

 for i in soup.find_all('p'):
     title = i.find('strong')
     if not isinstance(title.nextSibling, NavigableString):
         body= title.nextSibling.nextSibling
         extra= body.nextSibling.nextSibling
     else:
         if len(title.nextSibling) > 3:
             body= title.nextSibling
             extra= body.nextSibling.nextSibling
         else:
             body= title.nextSibling.nextSibling.nextSibling
             extra= body.nextSibling.nextSibling

但這看起來效率不高。 我想知道是否有人有更好的解決方案?
任何幫助將不勝感激!

謝謝!

重要的是要注意.next_sibling也可以工作,因為您可能需要收集多個文本節點,所以您必須使用一些邏輯來知道調用它多少次。 在此示例中,我發現僅瀏覽后代會更容易,因為后代注意到了一些重要特征,這些特征標志着我要做一些不同的事情。

您只需要分解要抓取的內容的特征即可。 在這種簡單的情況下,我們知道:

  1. 當我們看到strong元素時,我們想要捕獲“標題”。
  2. 當我們看到第一個br元素時,我們想開始捕獲“ content”。
  3. 當我們看到第二個br元素時,我們要開始捕獲“額外內容”。

我們可以:

  1. 定位plans類以獲取所有計划。
  2. 然后,我們可以遍歷plans的所有后代節點。
  3. 如果看到標簽,請查看它是否符合上述條件之一,並准備在正確的容器中捕獲文本節點。
  4. 如果我們看到一個文本節點,並且已經准備好一個容器,請存儲該文本。
  5. 刪除不必要的前導和尾隨空白,並存儲計划的數據。
from bs4 import BeautifulSoup as bs
from bs4 import Tag, NavigableString

html = """
<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>
"""

soup = bs(html, 'html.parser')

content = []

# Iterate through all the plans
for plans in soup.select('.plans'):
    # Lists that will hold the text nodes of interest
    title = []
    body = []
    extra = []

    current = None  # Reference to  one of the above lists to store data
    br = 0  # Count number of br tags

    # Iterate through all the descendant nodes of a plan
    for node in plans.descendants:
        # See if the node is a Tag/Element
        if isinstance(node, Tag):
            if node.name == 'strong':
                # Strong tags/elements contain our title
                # So set the current container for text to the the title list
                current = title
            elif node.name == 'br':
                # We've found a br Tag/Element
                br += 1
                if br == 1:
                    # If this is the first, we need to set the current
                    # container for text to the body list
                    current = body
                elif br == 2:
                    # If this is the second, we need to set the current
                    # container for text to the extra list
                    current = extra
        elif isinstance(node, NavigableString) and current is not None:
            # We've found a navigable string (not a tag/element), so let's
            # store the text node in the current list container.
            # NOTE: You may have to filter out things like HTML comments in a real world example.
            current.append(node)

    # Store the captured title, body, and extra text for the current plan.
    # For each list, join the text into one string and strip leading and trailing whitespace
    # from each entry in the row.
    content.append([''.join(entry).strip() for entry in (title, body, extra)])

print(content)

然后,您可以根據需要打印數據,但是應該以一種很好的邏輯方式捕獲數據,如下所示:

[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]

有多種方法可以做到這一點,這只是其中之一。

使用切片的另一種方式,假設您的列表不可變

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"), "html.parser")

def slicing(l):
     new_list = []
     for i in range(0,len(l),3):
             new_list.append(l[i:i+3])
     return new_list

result = slicing(list(soup.stripped_strings))
print(result)

產量

[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]

在這種情況下,可以使用帶有separator=參數的BeautifulSoup的get_text()方法:

data = '''<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>'''


from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

print('{: ^25}{: ^25}{: ^25}'.format('Col1', 'Col2', 'Col3'))
for p in [[i.strip() for i in p.get_text(separator='|').split('|') if i.strip()] for p in soup.select('p.plans')]:
    print(''.join('{: ^25}'.format(i) for i in p))

打印:

      Col1                     Col2                     Col3           
TITLE CONTENT #1          BODY CONTENT #1         EXTRA CONTENT #1     
TITLE CONTENT #2          BODY CONTENT #2         EXTRA CONTENT #2     
TITLE CONTENT #3          BODY CONTENT #3         EXTRA CONTENT #3     

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM