I have a xml file and need to extract some of elements in the file.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE label-information SYSTEM ".\Label-Information.dtd">
<label-information date="20230116" time="0530" record-count="53866" run-number="2571">
<stroke date-last-modified="21-SEP-22">
<stroke-number>5358I</stroke-number>
<stroke-description>Regular Fit Linen Blend Trouser</stroke-description>
<contract-number>20880409</contract-number>
<contract-status>A</contract-status>
<department-number>T17</department-number>
<season>WI22</season>
<supplier-series>9763</supplier-series>
<country-code>BD</country-code>
<Factory-code>1000003577</Factory-code>
<productdesc>TROUSERS</productdesc>
<stroke-label>
<label-ref>TRL02</label-ref>
<label-category>Hanging</label-category>
<label-type>T</label-type>
<label-order>1</label-order>
</stroke-label>
<stroke-label>
<label-ref>K8F/M12302</label-ref>
<label-category>Care Label (Format\Size)</label-category>
<label-type>K</label-type>
<label-order>1</label-order>
<set-name>Value Linen Blend Trouser</set-name>
</stroke-label>
</stroke>
<stroke date-last-modified="21-SEP-22">
<stroke-number>5358I</stroke-number>
<stroke-description>Regular Fit Linen Blend Trouser</stroke-description>
<contract-number>20880408</contract-number>
<contract-status>A</contract-status>
<department-number>T17</department-number>
<season>WI22</season>
<supplier-series>4563</supplier-series>
<country-code>BD</country-code>
<Factory-code>1000003577</Factory-code>
<productdesc>TROUSERS</productdesc>
<stroke-label>
<label-ref>TRL02</label-ref>
<label-category>Hanging</label-category>
<label-type>T</label-type>
<label-order>1</label-order>
</stroke-label>
</stroke>
<stroke date-last-modified="13-APR-21">
<stroke-number>6388O</stroke-number>
<stroke-description>TRAVEL CHINO SLIM FIT</stroke-description>
<contract-number>20851293</contract-number>
<contract-status>A</contract-status>
<department-number>T17</department-number>
<season>WI22</season>
<supplier-series>9763</supplier-series>
<country-code>BD</country-code>
<Factory-code>1000003577</Factory-code>
<productdesc>TROUSERS</productdesc>
<stroke-label>
<label-ref>TRL02</label-ref>
<label-category>Hanging</label-category>
<label-type>T</label-type>
<label-order>1</label-order>
</stroke-label>
<stroke-label>
<label-ref>MS-CL1836B</label-ref>
<label-category>Frames / Hangers / Hooks & Loops</label-category>
<label-type>C</label-type>
</stroke-label>
<stroke-label>
<label-ref>MS-ZPCU</label-ref>
<label-category>Frames / Hangers / Hooks & Loops</label-category>
<label-type>C</label-type>
</stroke-label>
<stroke-label>
<label-ref>UPC/M11394</label-ref>
<label-category>UPC / Barcode & Price Labels & Tickets</label-category>
<label-type>C</label-type>
</stroke-label>
</stroke>
</label-information>
I need to extract each element separately and save in to a panda data frame.
import csv
import requests
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('PLML.xml')
root = tree.getroot()
tree = ET.parse('PLML.xml')
root = tree.getroot()
# create a list to store the data
data = []
# find all elements with tag "stroke"
strokes = root.findall('.//stroke')
# iterate through the strokes and store the data in a list
for stroke in strokes:
stroke_number = stroke.find("stroke-number").text
contract_number = stroke.find("contract-number").text
department_number = stroke.find("department-number").text
country_code = stroke.find("country-code").text
data.append([stroke_number, contract_number, department_number,
country_code])
# convert the list to a DataFrame
df = pd.DataFrame(data, columns=["stroke_number",
"contract_number", "department_number", "country_code"])
# write the DataFrame to an Excel file
df.to_excel("plmlfile.xlsx", index=False)
I need to extract elements such as stroke-number,contract-number, stroke-number and productdesc (label-ref,label-category,label-type,label-order)
I have tried to iterate root variable with a for loop but unable to capture specified elements and respective values as I expected. Can someone help me with this.
Final output has to be
Given your expected output, it can be done fairly simply using pandas.read_xml
:
strokes_doc = """[your xml above]"""
df = pd.read_xml(strokes_doc, xpath="//stroke")
df.iloc[:, [1, 3, 5, 8]]
Output, based on your sample xml (pardon the formatting):
stroke-number contract-number department-number country-code
0 5358I 20880409 T17 BD
1 5358I 20880408 T17 BD
2 6388O 20851293 T17 BD
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.