BeautifulSoup：將標簽（包含其他標簽）拆分為兩個字符串

Question

我正在將一些 HTML 字典數據按摩到 XML 以導入某些字典軟件。

原來的 HTML 看起來像這樣：

<div class="entry">
  <span class="headword">word</span> 
  <span class="pos">part of speech</span> 
  <span class="definition">sense1; sense2 
    <span class="example">(example2.1; example2.2)</span>
    ; sense3 <span class="example">(example3.1; example3.2)</span>
  </span> 
</div>

編輯：事實上，輸入的類與 output XML 標簽不完全匹配。 在我的示例中，這只是為了說明關系。 我需要用特定的 XML 標簽替換特定的類，但它們不匹配。

理想的最終結果如下所示：

<entry>
  <headword>word</headword>
  <pos>part of speech</pos>
  <sense>
    <definition>sense1</definition>
  </sense>
  <sense>
    <definition>sense2</definition>
    <example>example2.1</example>
    <example>example2.2</example>
  </sense>
  <sense>
    <definition>sense3</definition>
    <example>example3.1</example>
    <example>example3.2</example>
  </sense>
</entry>

我湯的當前 state （已完成直接替換）是：

<entry>
  <headword>word</headword>
  <pos>part of speech</pos>
  <definition>sense1; sense2
    <example>example2.1</example>
    <example>example2.2</example>
    ; sense3 
    <example>example3.1</example>
    <example>example3.2</example>
  </definition>
</entry>

map 1:1 的划分很簡單，將定義+示例包裝在一個感知標簽中應該也是如此，但問題是原始不加區別地使用的方式; 把感覺和例子分開。 這意味着我需要先拆分example標簽，然后再拆分definition標簽; （即有效地將;替換為</example>\n<example>或</definition>\n<definition> ）。 自從我開始寫這個問題以來，我已經想出了如何為示例執行此操作（因為它們只包含字符串），但是定義很可能本身包含<example>標簽，所以我不能只使用split()因為返回了一個列表 & 'list' object has no attribute 'split' 。

有沒有更簡單的方法來拆分包含其他標簽的標簽，還是我必須遍歷結果集列表並重新創建所有標簽？

tags = soup.find_all("example")
for tag in tags:
    tag.string = re.sub(r"[()]", "", tag.string)     # remove parentheses
    egs = tag.string.split("; ")     # or str(tag.contents).split("; ") ?
    new = ""
    if len(egs) > 1:
        for eg in reversed(egs[1:]):
            new = soup.new_tag("example")
            new.string = eg
            tag.insert_after(new)
        tag.string = egs[0]             # orig tag becomes 1st seg only

Answer 1

您可以檢查每個元素的soup.contents並通過遞歸遍歷soup.contents中的非字符串元素來構建結構：

from bs4 import BeautifulSoup, NavigableString
import re
def to_xml(d):
   r, s, k = [], None, []
   for i in filter(lambda x:x != '\n', d.contents):
      if isinstance(i, NavigableString):
         if s is not None:
            r.append((s, k))
         s = [j for i in re.sub('^\(|\)$', '', i).split('; ') if (j:=re.sub('^\W+|\W+$', '', i))]
         k = []
      else:
         k.append(i)
   r.append((s, k))
   for a, b in r:
      if a is not None:
         if len(a) == 1 and not b:
            yield f'<{(c:=" ".join(d["class"]))}>{a[0]}</{c}>\n'
         elif not b:
            yield from ["<{}>\n<{}>{}</{}>\n</{}>\n".format(c, c1, i, c1, c) if (c:=re.sub('[\d+\.]+$', '', i)) != (c1:=" ".join(d["class"])) else f"<{c}>{i}</{c}>" for i in a]
         else:
            yield from ["<{}>\n<{}>{}</{}>\n</{}>\n".format((c:=re.sub('[\d+\.]+$', '', i)), (c1:=" ".join(d["class"])), i, c1, c) for i in a[:-1]]
            yield "<{}>\n<{}>{}</{}>\n{}\n</{}>\n".format((c:=re.sub('[\d+\.]+$', '', a[-1])), (c1:=' '.join(d['class'])), a[-1], c1, '\n'.join(j for k in b for j in to_xml(k)), c)
      else:
          yield '<{}>{}</{}>'.format((c1:=" ".join(d["class"])), "\n".join(j for k in b for j in to_xml(k)), c1)

s = """
 <div class="entry">
   <span class="headword">word</span> 
   <span class="pos">part of speech</span> 
   <span class="definition">sense1; sense2 
   <span class="example">(example2.1; example2.2)</span>
    ; sense3 <span class="example">(example3.1; example3.2)</span>
   </span> 
 </div>
"""
r = BeautifulSoup(''.join(to_xml(BeautifulSoup(s, 'html.parser').div)), 'html.parser')
print(r)

Output：

<entry>
   <headword>word</headword>
   <pos>part of speech</pos>
   <sense>
      <definition>sense1</definition>
   </sense>
   <sense>
      <definition>sense2</definition>
      <example>example2.1</example>
      <example>example2.2</example>
   </sense>
   <sense>
      <definition>sense3</definition>
      <example>example3.1</example>
      <example>example3.2</example>
   </sense>
</entry>

BeautifulSoup：將標簽（包含其他標簽）拆分為兩個字符串

問題描述

1 個解決方案

解決方案1
-1 2021-05-16 16:20:19

BeautifulSoup：將標簽（包含其他標簽）拆分為兩個字符串

問題描述

1 個解決方案

解決方案1 -1 2021-05-16 16:20:19

解決方案1
-1 2021-05-16 16:20:19