[英]BeautifulSoup: split tag (containing other tags) into two at string
我正在將一些 HTML 字典數據按摩到 XML 以導入某些字典軟件。
原來的 HTML 看起來像這樣:
<div class="entry">
<span class="headword">word</span>
<span class="pos">part of speech</span>
<span class="definition">sense1; sense2
<span class="example">(example2.1; example2.2)</span>
; sense3 <span class="example">(example3.1; example3.2)</span>
</span>
</div>
編輯:事實上,輸入的類與 output XML 標簽不完全匹配。 在我的示例中,這只是為了說明關系。 我需要用特定的 XML 標簽替換特定的類,但它們不匹配。
理想的最終結果如下所示:
<entry>
<headword>word</headword>
<pos>part of speech</pos>
<sense>
<definition>sense1</definition>
</sense>
<sense>
<definition>sense2</definition>
<example>example2.1</example>
<example>example2.2</example>
</sense>
<sense>
<definition>sense3</definition>
<example>example3.1</example>
<example>example3.2</example>
</sense>
</entry>
我湯的當前 state (已完成直接替換)是:
<entry>
<headword>word</headword>
<pos>part of speech</pos>
<definition>sense1; sense2
<example>example2.1</example>
<example>example2.2</example>
; sense3
<example>example3.1</example>
<example>example3.2</example>
</definition>
</entry>
map 1:1 的划分很簡單,將定義+示例包裝在一個感知標簽中應該也是如此,但問題是原始不加區別地使用的方式;
把感覺和例子分開。 這意味着我需要先拆分example
標簽,然后再拆分definition
標簽;
(即有效地將;
替換為</example>\n<example>
或</definition>\n<definition>
)。 自從我開始寫這個問題以來,我已經想出了如何為示例執行此操作(因為它們只包含字符串),但是定義很可能本身包含<example>
標簽,所以我不能只使用split()
因為返回了一個列表 & 'list' object has no attribute 'split'
。
有沒有更簡單的方法來拆分包含其他標簽的標簽,還是我必須遍歷結果集列表並重新創建所有標簽?
tags = soup.find_all("example")
for tag in tags:
tag.string = re.sub(r"[()]", "", tag.string) # remove parentheses
egs = tag.string.split("; ") # or str(tag.contents).split("; ") ?
new = ""
if len(egs) > 1:
for eg in reversed(egs[1:]):
new = soup.new_tag("example")
new.string = eg
tag.insert_after(new)
tag.string = egs[0] # orig tag becomes 1st seg only
您可以檢查每個元素的soup.contents
並通過遞歸遍歷soup.contents
中的非字符串元素來構建結構:
from bs4 import BeautifulSoup, NavigableString
import re
def to_xml(d):
r, s, k = [], None, []
for i in filter(lambda x:x != '\n', d.contents):
if isinstance(i, NavigableString):
if s is not None:
r.append((s, k))
s = [j for i in re.sub('^\(|\)$', '', i).split('; ') if (j:=re.sub('^\W+|\W+$', '', i))]
k = []
else:
k.append(i)
r.append((s, k))
for a, b in r:
if a is not None:
if len(a) == 1 and not b:
yield f'<{(c:=" ".join(d["class"]))}>{a[0]}</{c}>\n'
elif not b:
yield from ["<{}>\n<{}>{}</{}>\n</{}>\n".format(c, c1, i, c1, c) if (c:=re.sub('[\d+\.]+$', '', i)) != (c1:=" ".join(d["class"])) else f"<{c}>{i}</{c}>" for i in a]
else:
yield from ["<{}>\n<{}>{}</{}>\n</{}>\n".format((c:=re.sub('[\d+\.]+$', '', i)), (c1:=" ".join(d["class"])), i, c1, c) for i in a[:-1]]
yield "<{}>\n<{}>{}</{}>\n{}\n</{}>\n".format((c:=re.sub('[\d+\.]+$', '', a[-1])), (c1:=' '.join(d['class'])), a[-1], c1, '\n'.join(j for k in b for j in to_xml(k)), c)
else:
yield '<{}>{}</{}>'.format((c1:=" ".join(d["class"])), "\n".join(j for k in b for j in to_xml(k)), c1)
s = """
<div class="entry">
<span class="headword">word</span>
<span class="pos">part of speech</span>
<span class="definition">sense1; sense2
<span class="example">(example2.1; example2.2)</span>
; sense3 <span class="example">(example3.1; example3.2)</span>
</span>
</div>
"""
r = BeautifulSoup(''.join(to_xml(BeautifulSoup(s, 'html.parser').div)), 'html.parser')
print(r)
Output:
<entry>
<headword>word</headword>
<pos>part of speech</pos>
<sense>
<definition>sense1</definition>
</sense>
<sense>
<definition>sense2</definition>
<example>example2.1</example>
<example>example2.2</example>
</sense>
<sense>
<definition>sense3</definition>
<example>example3.1</example>
<example>example3.2</example>
</sense>
</entry>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.