[英]Python3 - BeautifulSoup - Get value between two tags, where values between
I have below HTML blocks, that are generated by pdftotext
using the -bbox-layout
option:我有以下 HTML 块,它们是由
pdftotext
使用-bbox-layout
选项生成的:
<flow>
<block xMin="21.600000" yMin="86.356000" xMax="178.647000" yMax="116.233001">
<line xMin="21.600000" yMin="86.356000" xMax="178.647000" yMax="101.833000">
<word xMin="21.600000" yMin="86.356000" xMax="178.647000" yMax="101.833000">
My text string located here!</word>
</line>
</block>
</flow>
[...]
<flow>
<block xMin="223.560000" yMin="323.675000" xMax="345.563500" yMax="339.855500">
<line xMin="223.560000" yMin="323.675000" xMax="345.563500" yMax="339.855500">
<word xMin="223.560000" yMin="323.675000" xMax="316.836500" yMax="339.855500">Another string
</word>
<word xMin="320.022000" yMin="323.675000" xMax="345.563500" yMax="339.855500">And another!</word>
</line>
</block>
</flow>
Now, I am trying to dynamically parse the above structure, and get each <block>[...]</block>
content, where the values xMin
and xMax
is between two numbers.现在,我正在尝试动态解析上述结构,并获取每个
<block>[...]</block>
内容,其中xMin
和xMax
值介于两个数字之间。
Imagine I have below numbers:想象一下我有以下数字:
areas[0] = (100, 0, 200, 792)
areas[1] = (200, 0, 612, 792)
with open(path_to_html_document) as html_file:
parsed_html = BeautifulSoup(html_file)
for (i, area) in enumerate(areas):
xMinValue, xMaxValue = areas[i][0], areas[i][2]
block_tags = parsed_html.find_all(
"block", attrs={"xMin": xMinValue, "xMax": xMaxValue})
print(block_tags)
Above code doesn't return anything, because there are no matching tags.上面的代码没有返回任何东西,因为没有匹配的标签。 The
find_all()
search for exact matches for block
tags with the specific numbers - but I am trying to search for block
tags, where xMin and xMax is: find_all()
搜索具有特定数字的block
标记的完全匹配项 - 但我正在尝试搜索block
标记,其中 xMin 和 xMax 是:
areas[0] is between 100 and 200
areas[1] is between 200 and 612
is this possible with BeautifulSoup? BeautifulSoup 可以做到这一点吗?
Replace your code :替换您的代码:
block_tags = parsed_html.find_all(
"block", attrs={"xMin": xMinValue, "xMax": xMaxValue})
print(block_tags)
TO:到:
block_tags = parsed_html.find_all("block")
for block in block_tags:
if float(block['xmin']) >= xMinValue and float(block['xmax']) <= xMinValue:
print(block)
If debug html code print(parsed_html)
, you will see html block
all attribute in small letter.如果调试 html 代码
print(parsed_html)
,您将看到html block
以小写字母html block
所有属性。
try尝试
parsed_html.select("block")
and filt the result with key "xMin" and "xMax".并使用键“xMin”和“xMax”过滤结果。
For example, if you want to get <block xMin="1" xMax="2"></block>
, you can first get all block
tags by例如,如果你想获得
<block xMin="1" xMax="2"></block>
时,可以先获得所有block
代码由
all_blocks = parsed_html.select("block")
And you want to get one of the block
whose xMin
is 1 and xMax
is 2, you can make it like:并且您想要获得
xMin
为 1 且xMax
为 2 的block
之一,您可以将其设置为:
target = filter(lambda x: x["xMin"] == "1" and x["xMax"] == 2, all_blocks)
You can select <block>
with attributes xMin
and xMax
with CSS selector block[xMin][xMax]
.您可以使用 CSS 选择器
block[xMin][xMax]
选择具有xMin
和xMax
属性的<block>
。 Then you do filtering through list comprehension:然后你通过列表理解进行过滤:
data = '''<flow>
<block xMin="21.600000" yMin="86.356000" xMax="178.647000" yMax="116.233001">
<line xMin="21.600000" yMin="86.356000" xMax="178.647000" yMax="101.833000">
<word xMin="21.600000" yMin="86.356000" xMax="178.647000" yMax="101.833000">
My text string located here!</word>
</line>
</block>
</flow>
<flow>
<block xMin="223.560000" yMin="323.675000" xMax="345.563500" yMax="339.855500">
<line xMin="223.560000" yMin="323.675000" xMax="345.563500" yMax="339.855500">
<word xMin="223.560000" yMin="323.675000" xMax="316.836500" yMax="339.855500">Another string
</word>
<word xMin="320.022000" yMin="323.675000" xMax="345.563500" yMax="339.855500">And another!</word>
</line>
</block>
</flow>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
def blocks_min_max(soup, x_min, x_max):
return [b for b in soup.select('block[xMin][xMax]') if float(b['xmin']) >= x_min and float(b['xmax']) <= x_max]
for b in blocks_min_max(soup, 10, 200):
print(b.prettify())
Prints:印刷:
<block xmax="178.647000" xmin="21.600000" ymax="116.233001" ymin="86.356000">
<line xmax="178.647000" xmin="21.600000" ymax="101.833000" ymin="86.356000">
<word xmax="178.647000" xmin="21.600000" ymax="101.833000" ymin="86.356000">
My text string located here!
</word>
</line>
</block>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.