简体   繁体   中英

Extracting tags and attributes from a xml file

I have a xml file and need to extract some of elements in the file.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE label-information SYSTEM ".\Label-Information.dtd">
<label-information date="20230116" time="0530" record-count="53866" run-number="2571">
    <stroke date-last-modified="21-SEP-22">
            <stroke-number>5358I</stroke-number>
            <stroke-description>Regular Fit Linen Blend Trouser</stroke-description>
            <contract-number>20880409</contract-number>
            <contract-status>A</contract-status>
            <department-number>T17</department-number>
            <season>WI22</season>
            <supplier-series>9763</supplier-series>
            <country-code>BD</country-code>
            <Factory-code>1000003577</Factory-code>
            <productdesc>TROUSERS</productdesc>
            <stroke-label>
                <label-ref>TRL02</label-ref>
                <label-category>Hanging</label-category>
                <label-type>T</label-type>
                <label-order>1</label-order>
            </stroke-label>
            <stroke-label>
                <label-ref>K8F/M12302</label-ref>
                <label-category>Care Label (Format\Size)</label-category>
                <label-type>K</label-type>
                <label-order>1</label-order>
                <set-name>Value Linen Blend Trouser</set-name>
            </stroke-label>
    </stroke>
        <stroke date-last-modified="21-SEP-22">
        <stroke-number>5358I</stroke-number>
        <stroke-description>Regular Fit Linen Blend Trouser</stroke-description>
        <contract-number>20880408</contract-number>
        <contract-status>A</contract-status>
        <department-number>T17</department-number>
        <season>WI22</season>
        <supplier-series>4563</supplier-series>
        <country-code>BD</country-code>
        <Factory-code>1000003577</Factory-code>
        <productdesc>TROUSERS</productdesc>
        <stroke-label>
            <label-ref>TRL02</label-ref>
            <label-category>Hanging</label-category>
            <label-type>T</label-type>
            <label-order>1</label-order>
        </stroke-label>
    </stroke>
        <stroke date-last-modified="13-APR-21">
            <stroke-number>6388O</stroke-number>
            <stroke-description>TRAVEL CHINO SLIM FIT</stroke-description>
            <contract-number>20851293</contract-number>
            <contract-status>A</contract-status>
            <department-number>T17</department-number>
            <season>WI22</season>
            <supplier-series>9763</supplier-series>
            <country-code>BD</country-code>
            <Factory-code>1000003577</Factory-code>
            <productdesc>TROUSERS</productdesc>
            <stroke-label>
                <label-ref>TRL02</label-ref>
                <label-category>Hanging</label-category>
                <label-type>T</label-type>
                <label-order>1</label-order>
            </stroke-label>
            <stroke-label>
                <label-ref>MS-CL1836B</label-ref>
                <label-category>Frames / Hangers / Hooks &amp; Loops</label-category>
                <label-type>C</label-type>
            </stroke-label>
            <stroke-label>
                <label-ref>MS-ZPCU</label-ref>
                <label-category>Frames / Hangers / Hooks &amp; Loops</label-category>
                <label-type>C</label-type>
            </stroke-label>
            <stroke-label>
                <label-ref>UPC/M11394</label-ref>
                <label-category>UPC / Barcode &amp; Price Labels &amp; Tickets</label-category>
                <label-type>C</label-type>
            </stroke-label>
    </stroke>
</label-information>

I need to extract each element separately and save in to a panda data frame.

import csv
import requests
import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse('PLML.xml')
root = tree.getroot()
tree = ET.parse('PLML.xml')
root = tree.getroot()
# create a list to store the data
data = []

# find all elements with tag "stroke"
strokes = root.findall('.//stroke')

 # iterate through the strokes and store the data in a list
 for stroke in strokes:
     stroke_number = stroke.find("stroke-number").text
     contract_number = stroke.find("contract-number").text
     department_number = stroke.find("department-number").text
     country_code = stroke.find("country-code").text
     data.append([stroke_number, contract_number, department_number, 
     country_code])
     # convert the list to a DataFrame
     df = pd.DataFrame(data, columns=["stroke_number", 
     "contract_number", "department_number", "country_code"])

     # write the DataFrame to an Excel file
      df.to_excel("plmlfile.xlsx", index=False)

I need to extract elements such as stroke-number,contract-number, stroke-number and productdesc (label-ref,label-category,label-type,label-order)

I have tried to iterate root variable with a for loop but unable to capture specified elements and respective values as I expected. Can someone help me with this.

Final output has to be

在此处输入图像描述

Given your expected output, it can be done fairly simply using pandas.read_xml :

strokes_doc = """[your xml above]"""
df = pd.read_xml(strokes_doc, xpath="//stroke")    
df.iloc[:, [1, 3, 5, 8]]

Output, based on your sample xml (pardon the formatting):

stroke-number   contract-number     department-number   country-code
0   5358I   20880409    T17     BD
1   5358I   20880408    T17     BD
2   6388O   20851293    T17     BD

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM