python中的漂亮湯xml格式

Question

我有以下格式的xml數據集標記：

<catchphrase "id=c0">unconscionable conduct</catchphrase>

我認為當他們制作數據集時，他們並沒有格式化id屬性，因為它必須是：

<catchphrase id="c0">unconscionable conduct</catchphrase>

但是，當通過python中的Beautiful Soap lib進行處理時，結果如下：

 soup = BeautifulSoup(content, 'xml')

結果是

 <catchphrase>
   "id=c0"&gt;application for leave to appeal
  </catchphrase>

要么

soup = BeautifulSoup(content, 'lxml')

結果是

<html>
   <body>
    ...
     <catchphrase>
         application for leave to appeal
     </catchphrase>
    ....

我想看起來像第二個，但沒有html和body標記（這是一個XML文檔）。 我不需要id屬性。 在將它寫入文件之前，我還使用了soup.prettify('utf-8') ，但我認為在執行該操作時它已經被錯誤地格式化了。

Answer 1

沒有這樣的標准方法，但是您可以做的是用正確的方法替換有問題的部分，如下所示：

from bs4 import BeautifulSoup
content = '<catchphrase "id=c0">unconscionable conduct</catchphrase>'

soup = BeautifulSoup(content.replace('"id=', 'id="'), 'xml')
print soup

結果是：

<catchphrase id="c0">unconscionable conduct</catchphrase>

這絕對是個小問題，因為沒有標准的方法來處理此問題，這主要是因為在BeautifulSoup解析之前XML應該是正確的。

python中的漂亮湯xml格式

問題描述

1 個解決方案

解決方案1
2 已采納 2017-03-30 18:28:13

python中的漂亮湯xml格式

問題描述

1 個解決方案

解決方案1 2 已采納 2017-03-30 18:28:13

解決方案1
2 已采納 2017-03-30 18:28:13