簡體   English   中英

Python - 理解 BeautifulSoup XML 結構和(重新)寫入文件的問題

[英]Python - Problem understanding the BeautifulSoup XML structure and (re-)writing to file

我正在嘗試使用 Python 將xml file (已過濾)重寫為properties file (注釋、鍵、值)。

我的xml file如下所示:

<?xml version="1.0" encoding="utf-8"?><martif type="TBX" xml:lang="en">
    <text>
      <body>
        <termEntry id="tid_db6_0A1B2A71C557C74BB35B3259595175B1">
            <descrip type="definition">A short note to explain the term</descrip>
            <admin type="conceptOrigin">de</admin>
            <descrip type="characteristic">standardEntry</descrip>
            <admin type="sapAddProductSubset">TM; EHS-MGM; </admin>
            <descrip type="sapAddProductSubsetSubjectField">Transportation Management; ; </descrip>
            <descrip type="sapNonTranslatable"/>
            <descrip type="sapLegalRestriction"/>
            <descrip type="sapProprietaryRestriction"/>
            <descrip type="saptermCategory"/>
            <descrip type="entryNote"/>
            <admin type="productSubset">AC</admin>
            <descrip type="subjectField">Environment, Health, and Safety / Product Compliance</descrip>
            <langSet xml:lang="ES">
                <ntig>
                  <termGrp>
                    <term id="oid_db6_3A8FB437DC603F4BA18F360BC33166BD">compte</term>
                    <termNote type="partOfSpeech">Noun</termNote>
                  </termGrp>
                  <admin type="annotatedNote"/>
                </ntig>
            </langSet>
          </termEntry>
        <termEntry id="tid_db6_0A1B2A71C557C74BB35B3259595175BF">
            <descrip type="definition"/>
            <admin type="conceptOrigin">de</admin>
            <descrip type="characteristic">standardEntry</descrip>
            <admin type="sapAddProductSubset">TM; EHS-MGM; </admin>
            <descrip type="sapAddProductSubsetSubjectField">Transportation Management; ; </descrip>
            <descrip type="sapNonTranslatable"/>
            <descrip type="sapLegalRestriction"/>
            <descrip type="sapProprietaryRestriction"/>
            <descrip type="saptermCategory"/>
            <descrip type="entryNote"/>
            <admin type="productSubset">EHS</admin>
            <descrip type="subjectField">Environment, Health, and Safety / Product Compliance</descrip>
            <langSet xml:lang="ES">
                <ntig>
                  <termGrp>
                    <term id="oid_db6_3A8FB437DC603F4BA18F360BC33166BD">daño</term>
                    <termNote type="partOfSpeech">Noun</termNote>
                  </termGrp>
                  <admin type="annotatedNote"/>
                </ntig>
            </langSet>
          </termEntry>            
        </body>
    </text>

我寫了以下代碼:

from bs4 import BeautifulSoup
import io
with open('SAPterm_TEST_ES.tbx', 'r') as f:
    file = f.read()

soup = BeautifulSoup(file, 'xml')

with io.open('ES.properties', 'w+') as f:

    Term = soup.find_all('termEntry')
    for termEntry in Term:
        print(termEntry('admin', {'type': 'productSubset'}))
        #f.write(termEntry('admin', {'type': 'productSubset'}).text)
        #f.write(' - ')        
        #f.write(termEntry('descrip', {'type': 'definition'}).text)
        #f.write('\n')
        f.write(termEntry['id'])
        f.write(' = ')
        f.write(termEntry.term.text)
        f.write('\n')

結果是這樣的:

A) 控制台中的 output:

TBX_2_properties.py
[<admin type="productSubset">AC</admin>]
[<admin type="productSubset">EHS</admin>]

B)結果文件:

tid_db6_0A1B2A71C557C74BB35B3259595175B1 = compte
tid_db6_0A1B2A71C557C74BB35B3259595175BF = daño

我遇到的問題:我可以 output 我想在屬性文件中包含的標簽作為打印到控制台的注釋,但是當將它寫入文件時,我總是失敗。

錯誤:

ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?


 File "C:\temp\Rewrite_TBX_2_properties.py", line 13, in <module>
    f.write(termEntry('admin', {'type': 'productSubset'}).text)

(例如,當我取消注釋第 13 行“ f.write(termEntry('admin', {'type': 'productSubset'}).text)

我不明白,需要您的幫助(我是 Phyton 的新手;;-))

此外,當我在更大的 xml 文件上嘗試此操作時(上面的那個只是我用來測試基礎知識的一個小版本,我得到了鍵/值對,其中所有鍵都從termEntry id -attribute 中正確提取,但值始終相同,始終是第一個條目中的那個。

有人建議嗎?

非常感謝!

順便說一句,較大的 XML 文件的結果如下所示:

tid_db6_015DAA6C5610D311AE6500A0C9EAAA94 = megbontási maradvány
tid_db6_01A0763DDC77D3118F330060B03CA38C = megbontási maradvány
tid_db6_01ADF9FDE40072439FF56E731E7EA2F6 = megbontási maradvány
tid_db6_01BCEA3AD3E2D3119B4F0060B0671ACC = megbontási maradvány
tid_db6_02BF9A6D9898D511AE780800062AFB0F = megbontási maradvány
tid_db6_0381F77095126448887E05013CBB4682 = megbontási maradvány
tid_db6_03CFFC4C6B64D311B60F0060B03C2BFF = megbontási maradvány
tid_db6_043968D0FAB9484DA122122A31C7A95C = megbontási maradvány

termEntry('admin', {'type': 'productSubset'})將返回ResultSet並且此 object 沒有.text屬性-因此您會收到此錯誤。 您應該遍歷此結果集,然后使用.text

如果soup包含問題中的 XML 文檔,您可以執行以下操作:

with open("out.txt", "w") as f_out:
    for term_entry in soup.select("termEntry"):
        admin = term_entry.select_one('admin[type="productSubset"]')
        desc = term_entry.select_one('descrip[type="definition"]')
        print("{} - {}".format(admin.text, desc.text), file=f_out)

        for term in term_entry.select("term"):
            print("{} = {}".format(term_entry["id"], term.text), file=f_out)

這將創建out.txt內容:

AC - A short note to explain the term
tid_db6_0A1B2A71C557C74BB35B3259595175B1 = compte
EHS - 
tid_db6_0A1B2A71C557C74BB35B3259595175BF = daño

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM