簡體   English   中英

BeautifulSoup:將標簽(包含其他標簽)拆分為兩個字符串

[英]BeautifulSoup: split tag (containing other tags) into two at string

我正在將一些 HTML 字典數據按摩到 XML 以導入某些字典軟件

原來的 HTML 看起來像這樣:

<div class="entry">
  <span class="headword">word</span> 
  <span class="pos">part of speech</span> 
  <span class="definition">sense1; sense2 
    <span class="example">(example2.1; example2.2)</span>
    ; sense3 <span class="example">(example3.1; example3.2)</span>
  </span> 
</div>

編輯:事實上,輸入的類與 output XML 標簽不完全匹配。 在我的示例中,這只是為了說明關系。 我需要用特定的 XML 標簽替換特定的類,但它們不匹配。

理想的最終結果如下所示:

<entry>
  <headword>word</headword>
  <pos>part of speech</pos>
  <sense>
    <definition>sense1</definition>
  </sense>
  <sense>
    <definition>sense2</definition>
    <example>example2.1</example>
    <example>example2.2</example>
  </sense>
  <sense>
    <definition>sense3</definition>
    <example>example3.1</example>
    <example>example3.2</example>
  </sense>
</entry>

我湯的當前 state (已完成直接替換)是:

<entry>
  <headword>word</headword>
  <pos>part of speech</pos>
  <definition>sense1; sense2
    <example>example2.1</example>
    <example>example2.2</example>
    ; sense3 
    <example>example3.1</example>
    <example>example3.2</example>
  </definition>
</entry>

map 1:1 的划分很簡單,將定義+示例包裝在一個感知標簽中應該也是如此,但問題是原始不加區別地使用的方式; 把感覺和例子分開。 這意味着我需要先拆分example標簽,然后再拆分definition標簽; (即有效地將;替換為</example>\n<example></definition>\n<definition> )。 自從我開始寫這個問題以來,我已經想出了如何為示例執行此操作(因為它們只包含字符串),但是定義很可能本身包含<example>標簽,所以我不能只使用split()因為返回了一個列表 & 'list' object has no attribute 'split'

有沒有更簡單的方法來拆分包含其他標簽的標簽,還是我必須遍歷結果集列表並重新創建所有標簽?

tags = soup.find_all("example")
for tag in tags:
    tag.string = re.sub(r"[()]", "", tag.string)     # remove parentheses
    egs = tag.string.split("; ")     # or str(tag.contents).split("; ") ?
    new = ""
    if len(egs) > 1:
        for eg in reversed(egs[1:]):
            new = soup.new_tag("example")
            new.string = eg
            tag.insert_after(new)
        tag.string = egs[0]             # orig tag becomes 1st seg only

您可以檢查每個元素的soup.contents並通過遞歸遍歷soup.contents中的非字符串元素來構建結構:

from bs4 import BeautifulSoup, NavigableString
import re
def to_xml(d):
   r, s, k = [], None, []
   for i in filter(lambda x:x != '\n', d.contents):
      if isinstance(i, NavigableString):
         if s is not None:
            r.append((s, k))
         s = [j for i in re.sub('^\(|\)$', '', i).split('; ') if (j:=re.sub('^\W+|\W+$', '', i))]
         k = []
      else:
         k.append(i)
   r.append((s, k))
   for a, b in r:
      if a is not None:
         if len(a) == 1 and not b:
            yield f'<{(c:=" ".join(d["class"]))}>{a[0]}</{c}>\n'
         elif not b:
            yield from ["<{}>\n<{}>{}</{}>\n</{}>\n".format(c, c1, i, c1, c) if (c:=re.sub('[\d+\.]+$', '', i)) != (c1:=" ".join(d["class"])) else f"<{c}>{i}</{c}>" for i in a]
         else:
            yield from ["<{}>\n<{}>{}</{}>\n</{}>\n".format((c:=re.sub('[\d+\.]+$', '', i)), (c1:=" ".join(d["class"])), i, c1, c) for i in a[:-1]]
            yield "<{}>\n<{}>{}</{}>\n{}\n</{}>\n".format((c:=re.sub('[\d+\.]+$', '', a[-1])), (c1:=' '.join(d['class'])), a[-1], c1, '\n'.join(j for k in b for j in to_xml(k)), c)
      else:
          yield '<{}>{}</{}>'.format((c1:=" ".join(d["class"])), "\n".join(j for k in b for j in to_xml(k)), c1)
        

s = """
 <div class="entry">
   <span class="headword">word</span> 
   <span class="pos">part of speech</span> 
   <span class="definition">sense1; sense2 
   <span class="example">(example2.1; example2.2)</span>
    ; sense3 <span class="example">(example3.1; example3.2)</span>
   </span> 
 </div>
"""
r = BeautifulSoup(''.join(to_xml(BeautifulSoup(s, 'html.parser').div)), 'html.parser')
print(r)

Output:

<entry>
   <headword>word</headword>
   <pos>part of speech</pos>
   <sense>
      <definition>sense1</definition>
   </sense>
   <sense>
      <definition>sense2</definition>
      <example>example2.1</example>
      <example>example2.2</example>
   </sense>
   <sense>
      <definition>sense3</definition>
      <example>example3.1</example>
      <example>example3.2</example>
   </sense>
</entry>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM