简体   繁体   English

提取标签之间的文本块<br>没有标签标题

[英]Extract text blocks between tags separated by <br> without a tag title

I have a web page that has a series of tags with a specific class within the page.我有一个 web 页面,该页面有一系列标签,页面中有特定的 class。 The tags I'm interested in look like this:我感兴趣的标签如下所示:

<span class="my-span-class">
  "Text of interest before break"
  <br>
  "Text of interest after break"    
</span>

These elements have no title and are just tags filled with text and are each broken up only by 1这些元素没有标题,只是用文本填充的标签,每个元素只被 1 分解
tag.标签。 I want my end result to have "Text of interest before break" be in a separate list from "Text of interest after break" like this:我希望我的最终结果将“中断前感兴趣的文本”与“中断后感兴趣的文本”放在单独的列表中,如下所示:

my_list_1 [Text of interest before break #1, Text of interest before break #2, Text of interest before break #3, etc...]

my_list _2 [Text of interest after break #1, Text of interest after break #2, Text of interest after break #3, etc....]

However, I'm struggling to get from what's below to having two separate lists.但是,我正在努力从下面的内容中获得两个单独的列表。 This currently outputs the two string together like so: "Text of interest before breakText of interest after break"这当前将两个字符串一起输出,如下所示:“中断前感兴趣的文本中断后感兴趣的文本”

from bs4 import BeautifulSoup
import urllib.request

f = urllib.request.urlopen("html.html")

soup = BeautifulSoup(f)

# get the tag type that looks like the element shown above
myText = soup.find_all("span", class_="my-span-clas")

results = []

for i in myText:
    results.append(i.text.strip())

I want to have a separate list initialized (ie results_2 = []) and have "Text of interest after break" be stored there and have the first results list be reserved only for the "Text of interest before break"我想初始化一个单独的列表(即 results_2 = []),并将“中断后感兴趣的文本”存储在那里,并将第一个结果列表保留为“中断前感兴趣的文本”

You can use itertools.groupby to group nodes before and after <br> .您可以使用itertools.groupby<br>之前和之后对节点进行分组。

I've gone ahead and made it a bit more robust by handling non-text elements before and after <br> .通过在<br>之前和之后处理非文本元素,我继续前进并使其更加健壮。

from bs4 import BeautifulSoup, Tag
import itertools

soup = BeautifulSoup('''
<span class="my-span-class">
  before break 1
  <span>before break 1.1</span>
  <br>
  after break 1
</span>

<span class="my-span-class">
  before break 2
  <br>
  after break 2
  <span>after break 2.1</span>
</span>

''', 'html.parser')


befores, afters = [], []
for it in soup.select('.my-span-class'):
    # this will give you three groups
    groups = [list(g) for _, g in itertools.groupby(it.children, lambda c: c.name != 'br')]
    # we just need items before br and after br
    before, after = [g for g in groups if g[0].name != 'br']
    
    befores.extend(before)
    afters.extend(after)
             
print(befores)
print(afters)

which prints:打印:

['\n  before break 1\n  ', <span>before break 1.1</span>, '\n', '\n  before break 2\n  ']
['\n  after break 1\n', '\n  after break 2\n  ', <span>after break 2.1</span>, '\n']

This should be enough to demonstrate how you can partition children under an element.这应该足以演示如何在一个元素下划分子级。

The only thing left to do is to loop over befores and afters and clean up each item.剩下要做的就是循环afters befores清理每个项目。

Based on your html you can use contents to get the values from the tag.根据您的 html,您可以使用contents从标签中获取值。

contents[0] will return first string contents[0]将返回第一个字符串

contents[-1] will return last string contents[-1]将返回最后一个字符串

from bs4 import BeautifulSoup
html='''<span class="my-span-class">
  Text of interest before break
  <br>
  Text of interest after break   
</span>
<span class="my-span-class">
  Text of interest before break 1
  <br>
  Text of interest after break 1   
</span>
<span class="my-span-class">
  Text of interest before break 2
  <br>
  Text of interest after break 2    
</span>
'''
soup = BeautifulSoup(html, 'html.parser')
Beforelist=[]
Afterlist=[]
for item in soup.find_all("span", class_="my-span-class"):
    Beforelist.append(item.contents[0].strip())
    Afterlist.append(item.contents[-1].strip())
    
print(Beforelist)
print(Afterlist)

Output : Output

['Text of interest before break', 'Text of interest before break 1', 'Text of interest before break 2']
['Text of interest after break', 'Text of interest after break 1', 'Text of interest after break 2']

You could also use .stripped_strings in combination with zip(*iterable) to unpack them seperately.您还可以将.stripped_stringszip(*iterable)结合使用来单独解压缩它们。

myTexts = (tag.stripped_strings for tag in soup.find_all("span", class_="my-span-class"))
before, after = zip(*myTexts)

>>> before
('Text of interest before break', 'Text of interest before break 1', 'Text of interest before break 2')

>>> after
('Text of interest after break', 'Text of interest after break 1', 'Text of interest after break 2')

You can try htql:你可以试试html:

import htql

page="""
<span class="my-span-class">
  Text of interest before break #1
  <br> 
  Text of interest after break #1
</span>
<span class="my-span-class">
  Text of interest before break #2
  <br> 
  Text of interest after break #2
</span>
"""

results1 = htql.query(page, "<span (class='my-span-class')>.<br>1:px &trim ")

results2 = htql.query(page, "<span (class='my-span-class')>.<br>1:fx &trim ")

It produces:它产生:

>>> results1
[('Text of interest before break #1',), ('Text of interest before break #2',)]
>>> results2
[('Text of interest after break #1',), ('Text of interest after break #2',)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM