用漂亮的汤对 html 标签进行解析和排序

Question

I have below HTML file, which contains bbox information from a PDF file:我有以下 HTML 文件，其中包含来自 PDF 文件的bbox信息：

<flow>
  <block xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">
    <line xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">
      <word xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">10</word>
    </line>
  </block>
</flow>
<flow>
  <block xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">
    <line xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">
      <word xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">20</word>
    </line>
  </block>
</flow>
<flow>
  <block xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">
    <line xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">
      <word xMin="111.351361" yMin="369.965298" xMax="116.331548" yMax="380.991433">1</word>
      <word xMin="121.909358" yMin="369.965298" xMax="134.220382" yMax="380.991433">PC</word>
    </line>
  </block>
</flow>

Above is the bounding box areas for the words: 10 20 1 PC以上是单词的边界框区域： 10 20 1 PC

In the original document, it is written like this:在原始文档中，是这样写的：

10 1 PC
20

Hence, I would like to parse above HTML file and extract all <line> tags, and then sort them all by the yMin value.因此，我想解析上面的 HTML 文件并提取所有<line>标签，然后按yMin值对它们进行排序。 The end output of above would then be: 10 1 PC 20 instead.上面的结尾 output 将是： 10 1 PC 20代替。

What I've tried so far到目前为止我尝试过的

I am not very far, as I am still learning Python.我不是很远，因为我还在学习 Python。 I am using BeautifulSoup4:我正在使用 BeautifulSoup4：

with open("test.html", "r") as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')

    for line in soup.find_all("line", attrs={"ymin":True}):
        print(line.get('ymin'))

Above simply prints out each tag and it's content.上面只是打印出每个标签及其内容。

I am unsure how I can sort the line tags though.我不确定如何对行标签进行排序。

Any help would be highly appreciated.任何帮助将不胜感激。

Answer 1

You can use BeautifulSoup with soup.find_all :您可以将BeautifulSoup与soup.find_all一起使用：

from bs4 import BeautifulSoup as soup
r = [i.find_all('word') for i in sorted(soup(html, 'html.parser').find_all('line'), key=lambda x:float(x['ymin']))]
result = [i.text for b in r for i in b]

Output: Output：

['10', '1', 'PC', '20']

Answer 2

Try the below code.Can can define the mean value and then check with mean value.试试下面的代码。可以定义平均值，然后检查平均值。

from bs4 import BeautifulSoup
html='''<flow>
  <block xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">
    <line xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">
      <word xMin="53.879997" yMin="369.965298" xMax="63.939976" yMax="380.991433">10</word>
    </line>
  </block>
</flow>
<flow>
  <block xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">
    <line xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">
      <word xMin="53.879997" yMin="417.965298" xMax="63.939976" yMax="428.991433">20</word>
    </line>
  </block>
</flow>
<flow>
  <block xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">
    <line xMin="111.351361" yMin="369.965298" xMax="134.220382" yMax="380.991433">
      <word xMin="111.351361" yMin="369.965298" xMax="116.331548" yMax="380.991433">1</word>
      <word xMin="121.909358" yMin="369.965298" xMax="134.220382" yMax="380.991433">PC</word>
    </line>
  </block>
</flow>'''

soup=BeautifulSoup(html,'lxml')
pricemin=soup.select_one('line[yMin]')['ymin']
list1=[]
list_last=[]
for item in soup.select('line[yMin]'):
    if float(pricemin) < float(item['ymin']):

         for w in item.select('word'):
             list_last.append(w.text)
    else:
        for w in item.select('word'):
            list1.append(w.text)

print(list1+list_last)

Output : Output ：

['10', '1', 'PC', '20']

To print this打印这个

print(' '.join(list1+list_last))

Output : Output ：

10 1 PC 20

用漂亮的汤对 html 标签进行解析和排序

问题描述

What I've tried so far到目前为止我尝试过的

2 个解决方案

解决方案1
1 已采纳 2019-10-04 12:49:12

解决方案2
0 2019-10-04 09:35:03

用漂亮的汤对 html 标签进行解析和排序

问题描述

What I've tried so far到目前为止我尝试过的

2 个解决方案

解决方案1 1 已采纳 2019-10-04 12:49:12

解决方案2 0 2019-10-04 09:35:03

解决方案1
1 已采纳 2019-10-04 12:49:12

解决方案2
0 2019-10-04 09:35:03