简体   繁体   中英

Parse xml w/ xsd to CSV with Python?

I am trying to parse a very large XML file which I downloaded from OSHA's website and convert it into a CSV so I can use it in a SQLite database along with some other spreadsheets. I would just use an online converter, but the osha file is apparently too big for all of them.

I wrote a script in Python which looks like this:

import csv
import xml.etree.cElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

xml_data_to_csv =open('Out.csv', 'w')

list_head=[]

Csv_writer=csv.writer(xml_data_to_csv)

count=0
for element in root.findall('data'): 
    List_nodes =[]

    if count== 0:
        inspection_number = element.find('inspection_number').tag
        list_head.append(inspection_number)
        
        establishment_name = element.find('establishment_name').tag
        list_head.append(establishment_name)
        
        city = element.find('city')
        list_head.append(city)

        state = element.find('state')
        list_head.append(state)
        
        zip_code = element.find('zip_code')
        list_head.append(zip_code)
        
        sic_code = element.find('sic_code')
        list_head.append(sic_code)
        
        naics_code = element.find('naics_code')
        list_head.append(naics_code)
        
        sampling_number = element.find('sampling_number')
        list_head.append(sampling_number)
        
        office_id = element.find('office_id')
        list_head.append(office_id)
        
        date_sampled = element.find('date_sampled')
        list_head.append(date_sampled)
        
        date_reported = element.find('date_reported')
        list_head.append(date_reported)
        
        eight_hour_twa_calc = element.find('eight_hour_twa_calc')
        list_head.append(eight_hour_twa_calc)
        
        instrument_type = element.find('instrument_type')
        list_head.append(instrument_type)
        
        lab_number = element.find('lab_number')
        list_head.append(lab_number)
        
        field_number = element.find('field_number')
        list_head.append(field_number)
        
        sample_type = element.find('sample_type')
        list_head.append(sample_type)
        
        blank_used = element.find('blank_used')
        list_head.append(blank_used)
        
        time_sampled = element.find('time_sampled')
        list_head.append(time_sampled)
        
        air_volume_sampled = element.find('air_volume_sampled')
        list_head.append(air_volume_sampled)
        
        sample_weight = element.find('sample_weight')
        list_head.append(sample_weight)
        
        imis_substance_code = element.find('imis_substance_code')
        list_head.append(imis_substance_code)
        
        substance = element.find('substance')
        list_head.append(substance)
        
        sample_result = element.find('sample_result')
        list_head.append(sample_result)
        
        unit_of_measurement = element.find('unit_of_measurement')
        list_head.append(unit_of_measurement)
        
        qualifier = element.find('qualifier')
        list_head.append(qualifier)

        Csv_writer.writerow(list_head)
        count = +1

    inspection_number = element.find('inspection_number').text
    List_nodes.append(inspection_number)

    establishment_name = element.find('establishment_name').text
    List_nodes.append(establishment_name)

    city = element.find('city').text
    List_nodes.append(city)

    state = element.find('state').text
    List_nodes.append(state)

    zip_code = element.find('zip_code').text
    List_nodes.append(zip_code)    

    sic_code = element.find('sic_code').text
    List_nodes.append(sic_code)

    naics_code = element.find('naics_code').text
    List_nodes.append(naics_code)

    sampling_number = element.find('sampling_number').text
    List_nodes.append(sampling_number)

    office_id = element.find('office_id').text
    List_nodes.append(office_id)

    date_sampled = element.find('date_sampled').text
    List_nodes.append(date_sampled)

    date_reported = element.find('date_reported').text
    List_nodes.append(date_reported)

    eight_hour_twa_calc = element.find('eight_hour_twa_calc').text
    List_nodes.append(eight_hour_twa_calc)    
    
    instrument_type = element.find('instrument_type').text
    List_nodes.append(instrument_type)

    lab_number = element.find('lab_number').text
    List_nodes.append(lab_number)

    field_number = element.find('field_number').text
    List_nodes.append(field_number)

    sample_type = element.find('sample_type').text
    List_nodes.append(sample_type)

    blank_used = element.find('blank_used').text
    List_nodes.append()

    time_sampled = element.find('time_sampled').text
    List_nodes.append(time_sampled)

    air_volume_sampled = element.find('air_volume_sampled').text
    List_nodes.append(air_volume_sampled)    
    
    sample_weight = element.find('sample_weight').text
    List_nodes.append(sample_weight)

    imis_substance_code = element.find('imis_substance_code').text
    List_nodes.append(imis_substance_code)

    substance = element.find('substance').text
    List_nodes.append(substance)

    sample_result = element.find('sample_result').text
    List_nodes.append(sample_result)

    unit_of_measurement = element.find('unit_of_measurement').text 
    List_nodes.append(unit_of_measurement)

    qualifier= element.find('qualifier').text
    List_nodes.append(qualifier)

    Csv_writer.writerow(List_nodes)

xml_data_to_csv.close()

But when I run the code I get a CSV with nothing in it. I suspect this may have something to do with the XSD file associated with the XML, but I'm not totally sure.

Does anyone know what the issue is here?

The code below is a 'compact' version of your code.

It assumes that the XML structure looks like in the script variable xml . (Based on https://www.osha.gov/opengov/sample_data_2011.zip )

The main difference bwtween this sample code and yours is that I define the fields that I want to collect once (see FIELDS ) and I use this definition across the script.

import xml.etree.ElementTree as ET

FIELDS = ['lab_number', 'instrument_type']  # TODO add more fields

xml = '''<main xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="health_sample_data.xsd">
  <DATA_RECORD>
    <inspection_number>316180165</inspection_number>
    <establishment_name>PROFESSIONAL ENGINEERING SERVICES, LLC.</establishment_name>
    <city>EUFAULA</city>
    <state>AL</state>
    <zip_code>36027</zip_code>
    <sic_code>1799</sic_code>
    <naics_code>238990</naics_code>
    <sampling_number>434866166</sampling_number>
    <office_id>418600</office_id>
    <date_sampled>2011-12-30</date_sampled>
    <date_reported>2011-12-30</date_reported>
    <eight_hour_twa_calc>N</eight_hour_twa_calc>
    <instrument_type>TBD</instrument_type>
    <lab_number>L13645</lab_number>
    <field_number>S1</field_number>
    <sample_type>B</sample_type>
    <blank_used>N</blank_used>
    <time_sampled></time_sampled>
    <air_volume_sampled></air_volume_sampled>
    <sample_weight></sample_weight>
    <imis_substance_code>S777</imis_substance_code>
    <substance>Soil</substance>
    <sample_result>0</sample_result>
    <unit_of_measurement>AAAAA</unit_of_measurement>
    <qualifier></qualifier>
  </DATA_RECORD>
  <DATA_RECORD>
    <inspection_number>315516757</inspection_number>
    <establishment_name>MARGUERITE CONCRETE CO.</establishment_name>
    <city>WORCESTER</city>
    <state>MA</state>
    <zip_code>1608</zip_code>
    <sic_code>1771</sic_code>
    <naics_code>238110</naics_code>
    <sampling_number>423259902</sampling_number>
    <office_id>112600</office_id>
    <date_sampled>2011-12-30</date_sampled>
    <date_reported>2011-12-30</date_reported>
    <eight_hour_twa_calc>N</eight_hour_twa_calc>
    <instrument_type>GRAV</instrument_type>
    <lab_number>L13355</lab_number>
    <field_number>9831B</field_number>
    <sample_type>P</sample_type>
    <blank_used>N</blank_used>
    <time_sampled>184</time_sampled>
    <air_volume_sampled>340.4</air_volume_sampled>
    <sample_weight>.06</sample_weight>
    <imis_substance_code>9135</imis_substance_code>
    <substance>Particulates not otherwise regulated (Total Dust)</substance>
    <sample_result>0.176</sample_result>
    <unit_of_measurement>M</unit_of_measurement>
    <qualifier></qualifier>
  </DATA_RECORD></main>'''

root = ET.fromstring(xml)
records = root.findall('.//DATA_RECORD')
with open('out.csv', 'w') as out:
    out.write(','.join(FIELDS) + '\n')
    for record in records:
        values = [record.find(f).text for f in FIELDS]
        out.write(','.join(values) + '\n')

out.csv

lab_number,instrument_type
L13645,TBD
L13355,GRAV

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM