[英]Python - Problem understanding the BeautifulSoup XML structure and (re-)writing to file
我正在嘗試使用 Python 將xml file
(已過濾)重寫為properties file
(注釋、鍵、值)。
我的xml file
如下所示:
<?xml version="1.0" encoding="utf-8"?><martif type="TBX" xml:lang="en">
<text>
<body>
<termEntry id="tid_db6_0A1B2A71C557C74BB35B3259595175B1">
<descrip type="definition">A short note to explain the term</descrip>
<admin type="conceptOrigin">de</admin>
<descrip type="characteristic">standardEntry</descrip>
<admin type="sapAddProductSubset">TM; EHS-MGM; </admin>
<descrip type="sapAddProductSubsetSubjectField">Transportation Management; ; </descrip>
<descrip type="sapNonTranslatable"/>
<descrip type="sapLegalRestriction"/>
<descrip type="sapProprietaryRestriction"/>
<descrip type="saptermCategory"/>
<descrip type="entryNote"/>
<admin type="productSubset">AC</admin>
<descrip type="subjectField">Environment, Health, and Safety / Product Compliance</descrip>
<langSet xml:lang="ES">
<ntig>
<termGrp>
<term id="oid_db6_3A8FB437DC603F4BA18F360BC33166BD">compte</term>
<termNote type="partOfSpeech">Noun</termNote>
</termGrp>
<admin type="annotatedNote"/>
</ntig>
</langSet>
</termEntry>
<termEntry id="tid_db6_0A1B2A71C557C74BB35B3259595175BF">
<descrip type="definition"/>
<admin type="conceptOrigin">de</admin>
<descrip type="characteristic">standardEntry</descrip>
<admin type="sapAddProductSubset">TM; EHS-MGM; </admin>
<descrip type="sapAddProductSubsetSubjectField">Transportation Management; ; </descrip>
<descrip type="sapNonTranslatable"/>
<descrip type="sapLegalRestriction"/>
<descrip type="sapProprietaryRestriction"/>
<descrip type="saptermCategory"/>
<descrip type="entryNote"/>
<admin type="productSubset">EHS</admin>
<descrip type="subjectField">Environment, Health, and Safety / Product Compliance</descrip>
<langSet xml:lang="ES">
<ntig>
<termGrp>
<term id="oid_db6_3A8FB437DC603F4BA18F360BC33166BD">daño</term>
<termNote type="partOfSpeech">Noun</termNote>
</termGrp>
<admin type="annotatedNote"/>
</ntig>
</langSet>
</termEntry>
</body>
</text>
我寫了以下代碼:
from bs4 import BeautifulSoup
import io
with open('SAPterm_TEST_ES.tbx', 'r') as f:
file = f.read()
soup = BeautifulSoup(file, 'xml')
with io.open('ES.properties', 'w+') as f:
Term = soup.find_all('termEntry')
for termEntry in Term:
print(termEntry('admin', {'type': 'productSubset'}))
#f.write(termEntry('admin', {'type': 'productSubset'}).text)
#f.write(' - ')
#f.write(termEntry('descrip', {'type': 'definition'}).text)
#f.write('\n')
f.write(termEntry['id'])
f.write(' = ')
f.write(termEntry.term.text)
f.write('\n')
結果是這樣的:
A) 控制台中的 output:
TBX_2_properties.py
[<admin type="productSubset">AC</admin>]
[<admin type="productSubset">EHS</admin>]
B)結果文件:
tid_db6_0A1B2A71C557C74BB35B3259595175B1 = compte
tid_db6_0A1B2A71C557C74BB35B3259595175BF = daño
我遇到的問題:我可以 output 我想在屬性文件中包含的標簽作為打印到控制台的注釋,但是當將它寫入文件時,我總是失敗。
錯誤:
ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
File "C:\temp\Rewrite_TBX_2_properties.py", line 13, in <module>
f.write(termEntry('admin', {'type': 'productSubset'}).text)
(例如,當我取消注釋第 13 行“ f.write(termEntry('admin', {'type': 'productSubset'}).text)
”
我不明白,需要您的幫助(我是 Phyton 的新手;;-))
此外,當我在更大的 xml 文件上嘗試此操作時(上面的那個只是我用來測試基礎知識的一個小版本,我得到了鍵/值對,其中所有鍵都從termEntry
id
-attribute 中正確提取,但值始終相同,始終是第一個條目中的那個。
有人建議嗎?
非常感謝!
順便說一句,較大的 XML 文件的結果如下所示:
tid_db6_015DAA6C5610D311AE6500A0C9EAAA94 = megbontási maradvány
tid_db6_01A0763DDC77D3118F330060B03CA38C = megbontási maradvány
tid_db6_01ADF9FDE40072439FF56E731E7EA2F6 = megbontási maradvány
tid_db6_01BCEA3AD3E2D3119B4F0060B0671ACC = megbontási maradvány
tid_db6_02BF9A6D9898D511AE780800062AFB0F = megbontási maradvány
tid_db6_0381F77095126448887E05013CBB4682 = megbontási maradvány
tid_db6_03CFFC4C6B64D311B60F0060B03C2BFF = megbontási maradvány
tid_db6_043968D0FAB9484DA122122A31C7A95C = megbontási maradvány
termEntry('admin', {'type': 'productSubset'})
將返回ResultSet
並且此 object 沒有.text
屬性-因此您會收到此錯誤。 您應該遍歷此結果集,然后使用.text
。
如果soup
包含問題中的 XML 文檔,您可以執行以下操作:
with open("out.txt", "w") as f_out:
for term_entry in soup.select("termEntry"):
admin = term_entry.select_one('admin[type="productSubset"]')
desc = term_entry.select_one('descrip[type="definition"]')
print("{} - {}".format(admin.text, desc.text), file=f_out)
for term in term_entry.select("term"):
print("{} = {}".format(term_entry["id"], term.text), file=f_out)
這將創建out.txt
內容:
AC - A short note to explain the term
tid_db6_0A1B2A71C557C74BB35B3259595175B1 = compte
EHS -
tid_db6_0A1B2A71C557C74BB35B3259595175BF = daño
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.