Python - 理解 BeautifulSoup XML 結構和（重新）寫入文件的問題

Question

我正在嘗試使用 Python 將xml file （已過濾）重寫為properties file （注釋、鍵、值）。

我的xml file如下所示：

<?xml version="1.0" encoding="utf-8"?><martif type="TBX" xml:lang="en">
    <text>
      <body>
        <termEntry id="tid_db6_0A1B2A71C557C74BB35B3259595175B1">
            <descrip type="definition">A short note to explain the term</descrip>
            <admin type="conceptOrigin">de</admin>
            <descrip type="characteristic">standardEntry</descrip>
            <admin type="sapAddProductSubset">TM; EHS-MGM; </admin>
            <descrip type="sapAddProductSubsetSubjectField">Transportation Management; ; </descrip>
            <descrip type="sapNonTranslatable"/>
            <descrip type="sapLegalRestriction"/>
            <descrip type="sapProprietaryRestriction"/>
            <descrip type="saptermCategory"/>
            <descrip type="entryNote"/>
            <admin type="productSubset">AC</admin>
            <descrip type="subjectField">Environment, Health, and Safety / Product Compliance</descrip>
            <langSet xml:lang="ES">
                <ntig>
                  <termGrp>
                    <term id="oid_db6_3A8FB437DC603F4BA18F360BC33166BD">compte</term>
                    <termNote type="partOfSpeech">Noun</termNote>
                  </termGrp>
                  <admin type="annotatedNote"/>
                </ntig>
            </langSet>
          </termEntry>
        <termEntry id="tid_db6_0A1B2A71C557C74BB35B3259595175BF">
            <descrip type="definition"/>
            <admin type="conceptOrigin">de</admin>
            <descrip type="characteristic">standardEntry</descrip>
            <admin type="sapAddProductSubset">TM; EHS-MGM; </admin>
            <descrip type="sapAddProductSubsetSubjectField">Transportation Management; ; </descrip>
            <descrip type="sapNonTranslatable"/>
            <descrip type="sapLegalRestriction"/>
            <descrip type="sapProprietaryRestriction"/>
            <descrip type="saptermCategory"/>
            <descrip type="entryNote"/>
            <admin type="productSubset">EHS</admin>
            <descrip type="subjectField">Environment, Health, and Safety / Product Compliance</descrip>
            <langSet xml:lang="ES">
                <ntig>
                  <termGrp>
                    <term id="oid_db6_3A8FB437DC603F4BA18F360BC33166BD">daño</term>
                    <termNote type="partOfSpeech">Noun</termNote>
                  </termGrp>
                  <admin type="annotatedNote"/>
                </ntig>
            </langSet>
          </termEntry>            
        </body>
    </text>

我寫了以下代碼：

from bs4 import BeautifulSoup
import io
with open('SAPterm_TEST_ES.tbx', 'r') as f:
    file = f.read()

soup = BeautifulSoup(file, 'xml')

with io.open('ES.properties', 'w+') as f:

    Term = soup.find_all('termEntry')
    for termEntry in Term:
        print(termEntry('admin', {'type': 'productSubset'}))
        #f.write(termEntry('admin', {'type': 'productSubset'}).text)
        #f.write(' - ')        
        #f.write(termEntry('descrip', {'type': 'definition'}).text)
        #f.write('\n')
        f.write(termEntry['id'])
        f.write(' = ')
        f.write(termEntry.term.text)
        f.write('\n')

結果是這樣的：

A) 控制台中的 output：

TBX_2_properties.py
[<admin type="productSubset">AC</admin>]
[<admin type="productSubset">EHS</admin>]

B）結果文件：

tid_db6_0A1B2A71C557C74BB35B3259595175B1 = compte
tid_db6_0A1B2A71C557C74BB35B3259595175BF = daño

我遇到的問題：我可以 output 我想在屬性文件中包含的標簽作為打印到控制台的注釋，但是當將它寫入文件時，我總是失敗。

錯誤：

ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?


 File "C:\temp\Rewrite_TBX_2_properties.py", line 13, in <module>
    f.write(termEntry('admin', {'type': 'productSubset'}).text)

（例如，當我取消注釋第 13 行“ f.write(termEntry('admin', {'type': 'productSubset'}).text) ”

我不明白，需要您的幫助（我是 Phyton 的新手；;-)）

此外，當我在更大的 xml 文件上嘗試此操作時（上面的那個只是我用來測試基礎知識的一個小版本，我得到了鍵/值對，其中所有鍵都從termEntry id -attribute 中正確提取，但值始終相同，始終是第一個條目中的那個。

有人建議嗎？

非常感謝！

順便說一句，較大的 XML 文件的結果如下所示：

tid_db6_015DAA6C5610D311AE6500A0C9EAAA94 = megbontási maradvány
tid_db6_01A0763DDC77D3118F330060B03CA38C = megbontási maradvány
tid_db6_01ADF9FDE40072439FF56E731E7EA2F6 = megbontási maradvány
tid_db6_01BCEA3AD3E2D3119B4F0060B0671ACC = megbontási maradvány
tid_db6_02BF9A6D9898D511AE780800062AFB0F = megbontási maradvány
tid_db6_0381F77095126448887E05013CBB4682 = megbontási maradvány
tid_db6_03CFFC4C6B64D311B60F0060B03C2BFF = megbontási maradvány
tid_db6_043968D0FAB9484DA122122A31C7A95C = megbontási maradvány

Answer 1

termEntry('admin', {'type': 'productSubset'})將返回ResultSet並且此 object 沒有.text屬性-因此您會收到此錯誤。 您應該遍歷此結果集，然后使用.text 。

如果soup包含問題中的 XML 文檔，您可以執行以下操作：

with open("out.txt", "w") as f_out:
    for term_entry in soup.select("termEntry"):
        admin = term_entry.select_one('admin[type="productSubset"]')
        desc = term_entry.select_one('descrip[type="definition"]')
        print("{} - {}".format(admin.text, desc.text), file=f_out)

        for term in term_entry.select("term"):
            print("{} = {}".format(term_entry["id"], term.text), file=f_out)

這將創建out.txt內容：

AC - A short note to explain the term
tid_db6_0A1B2A71C557C74BB35B3259595175B1 = compte
EHS - 
tid_db6_0A1B2A71C557C74BB35B3259595175BF = daño

Python - 理解 BeautifulSoup XML 結構和（重新）寫入文件的問題

問題描述

1 個解決方案

解決方案1
0 已采納 2022-08-19 16:28:17

Python - 理解 BeautifulSoup XML 結構和（重新）寫入文件的問題

問題描述

1 個解決方案

解決方案1 0 已采納 2022-08-19 16:28:17

解決方案1
0 已采納 2022-08-19 16:28:17