[英]Extracting data from xml with similar tag name using beautiful soup
使用美麗的湯從 web api 的 XML 響應中提取數據的挑戰
需要遍歷整個 xml 響應並從不同的標簽中獲取數據並存儲到數據框中
下面提到的是需要從 xml 中提取並遍歷整個 xml 並加載到數據框中的值的類型。
從<Value ref="52f3623a-497c0b0a154b">
找到 Ref 值
<UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias>
中的 Org 值(有兩個標簽同名 UniqueAlias,甚至順序不同)
來自層級標簽<Hierarchy><![CDATA[Guide]]></Hierarchy>
的值
來自附加字段標簽的值,其中屬性 label = "country" <AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField>
來自附加字段標簽的值,其中屬性 label = "PrStatus" <AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField>
XML 示例格式:
<Value ref="52f3623a-497c0b0a154b"><DisplayName origin="UID"><![CDATA[10056546]]></DisplayName><DisplayName origin="Default"><![CDATA[Guide]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias><Hierarchy><![CDATA[Guide]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[ABCD/I]]></AdditionalField><AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="4b0444e0-43137db45c1a"><DisplayName origin="Default"><![CDATA[Guide 3]]></DisplayName><UniqueAlias><![CDATA[ORG=EFG/C|Lim]]></UniqueAlias><UniqueAlias><![CDATA[STATUS=PMFDActive]]></UniqueAlias><Hierarchy><![CDATA[Guide 3]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[EFG/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[Malaysia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
<Value ref="4d43bb96-c6b0ad9709ec"><DisplayName origin="GERL"><![CDATA[Salmon]]></DisplayName><DisplayName origin="UID"><![CDATA[1184797]]></DisplayName><DisplayName origin="Default"><![CDATA[Salmon]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=LJK/N|Yuly ]]></UniqueAlias><Hierarchy><![CDATA[Salmon]]></Hierarchy><AdditionalField label="Field"><![CDATA[Salmon 1]]></AdditionalField><AdditionalField label="Organisation"><![CDATA[LJK/N|Yuly ]]></AdditionalField><AdditionalField label="Country"><![CDATA[India]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="1c0d6493-8f63c9043b5f"><DisplayName origin="Default"><![CDATA[Mini comp]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=xyz/C|Jason]]></UniqueAlias><Hierarchy><![CDATA[Mini comp]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[xyz/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[gorgeia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
</valueList>
Python 代碼:
text= response.content
soup = BeautifulSoup(text, "html.parser")
data = []
for value in soup.valuelist.find_all('value'):
additional_fields = [field.text for field in soup.find_all('additionalfield')]
data.append([
value['ref'],
value.uniquealias.text,
value.hierarchy.text,
additional_fields[1],
additional_fields[0],
additional_fields[2],
])
df = pd.DataFrame(data, columns=['ID', 'Status', 'Name', 'Country', 'org','ST'])
print(df)
提前致謝
'pandas' 1.3 有一個read_xml()
function,您可以嘗試使用它。 您將不得不嘗試使用參數,因為返回的數據似乎不是您想要的。 可能有一些嵌套節點:
import pandas as pd
text = '''<valueList><Value ref="52f3623a-497c0b0a154b"><DisplayName origin="UID"><![CDATA[10056546]]></DisplayName><DisplayName origin="Default"><![CDATA[Guide]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias><Hierarchy><![CDATA[Guide]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[ABCD/I]]></AdditionalField><AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="4b0444e0-43137db45c1a"><DisplayName origin="Default"><![CDATA[Guide 3]]></DisplayName><UniqueAlias><![CDATA[ORG=EFG/C|Lim]]></UniqueAlias><UniqueAlias><![CDATA[STATUS=PMFDActive]]></UniqueAlias><Hierarchy><![CDATA[Guide 3]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[EFG/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[Malaysia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
<Value ref="4d43bb96-c6b0ad9709ec"><DisplayName origin="GERL"><![CDATA[Salmon]]></DisplayName><DisplayName origin="UID"><![CDATA[1184797]]></DisplayName><DisplayName origin="Default"><![CDATA[Salmon]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=LJK/N|Yuly ]]></UniqueAlias><Hierarchy><![CDATA[Salmon]]></Hierarchy><AdditionalField label="Field"><![CDATA[Salmon 1]]></AdditionalField><AdditionalField label="Organisation"><![CDATA[LJK/N|Yuly ]]></AdditionalField><AdditionalField label="Country"><![CDATA[India]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="1c0d6493-8f63c9043b5f"><DisplayName origin="Default"><![CDATA[Mini comp]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=xyz/C|Jason]]></UniqueAlias><Hierarchy><![CDATA[Mini comp]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[xyz/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[gorgeia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
</valueList>'''
df = pd.read_xml(text)
Output:
print(df)
ref DisplayName ... Hierarchy AdditionalField
0 52f3623a-497c0b0a154b Guide ... Guide DActive
1 4b0444e0-43137db45c1a Guide 3 ... Guide 3 Active
2 4d43bb96-c6b0ad9709ec Salmon ... Salmon DActive
3 1c0d6493-8f63c9043b5f Mini comp ... Mini comp Active
選項 2:
您可以使用您的代碼,因為它很好。 但是需要更改其中的一些內容:
import pandas as pd
from bs4 import BeautifulSoup
text = '''<Value ref="52f3623a-497c0b0a154b"><DisplayName origin="UID"><![CDATA[10056546]]></DisplayName><DisplayName origin="Default"><![CDATA[Guide]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=ABCD/I|David ]]></UniqueAlias><Hierarchy><![CDATA[Guide]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[ABCD/I]]></AdditionalField><AdditionalField label="Country"><![CDATA[Singapore]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="4b0444e0-43137db45c1a"><DisplayName origin="Default"><![CDATA[Guide 3]]></DisplayName><UniqueAlias><![CDATA[ORG=EFG/C|Lim]]></UniqueAlias><UniqueAlias><![CDATA[STATUS=PMFDActive]]></UniqueAlias><Hierarchy><![CDATA[Guide 3]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[EFG/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[Malaysia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
<Value ref="4d43bb96-c6b0ad9709ec"><DisplayName origin="GERL"><![CDATA[Salmon]]></DisplayName><DisplayName origin="UID"><![CDATA[1184797]]></DisplayName><DisplayName origin="Default"><![CDATA[Salmon]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=LJK/N|Yuly ]]></UniqueAlias><Hierarchy><![CDATA[Salmon]]></Hierarchy><AdditionalField label="Field"><![CDATA[Salmon 1]]></AdditionalField><AdditionalField label="Organisation"><![CDATA[LJK/N|Yuly ]]></AdditionalField><AdditionalField label="Country"><![CDATA[India]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[DActive]]></AdditionalField></Value>
<Value ref="1c0d6493-8f63c9043b5f"><DisplayName origin="Default"><![CDATA[Mini comp]]></DisplayName><UniqueAlias><![CDATA[STATUS=Active]]></UniqueAlias><UniqueAlias><![CDATA[ORG=xyz/C|Jason]]></UniqueAlias><Hierarchy><![CDATA[Mini comp]]></Hierarchy><AdditionalField label="Organisation"><![CDATA[xyz/C]]></AdditionalField><AdditionalField label="Country"><![CDATA[gorgeia]]></AdditionalField><AdditionalField label="PrStatus"><![CDATA[Active]]></AdditionalField></Value>
</valueList>'''
soup = BeautifulSoup(text, "html.parser")
data = []
for value in soup.find_all('value'):
try:
vId = value['ref']
except:
vId = 'N/A'
print('id not found')
try:
status = value.uniquealias.text
except:
status = 'N/A'
print('status not found')
try:
name = value.hierarchy.text
except:
name = 'N/A'
print('name not found')
try:
country = value.find('additionalfield', {'label':'Country'}).text
except:
country = 'N/A'
print('country not found')
try:
org = value.find('additionalfield', {'label':'Organisation'}).text
except:
org = 'N/A'
print('org not found')
try:
st = value.find('additionalfield', {'label':'PrStatus'}).text
except:
st = 'N/A'
print('st not found')
row = {
'ID':vId,
'Status':status,
'Name':name,
'Country':country,
'org':org,
'ST':st}
data.append(row)
df = pd.DataFrame(data)
print(df)
Output:
ID Status ... org ST
0 52f3623a-497c0b0a154b STATUS=Active ... ABCD/I DActive
1 4b0444e0-43137db45c1a ORG=EFG/C|Lim ... EFG/C Active
2 4d43bb96-c6b0ad9709ec STATUS=Active ... LJK/N|Yuly DActive
3 1c0d6493-8f63c9043b5f STATUS=Active ... xyz/C Active
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.