Parse xml w/ xsd to CSV with Python?

Question

I am trying to parse a very large XML file which I downloaded from OSHA's website and convert it into a CSV so I can use it in a SQLite database along with some other spreadsheets. I would just use an online converter, but the osha file is apparently too big for all of them.

I wrote a script in Python which looks like this:

import csv
import xml.etree.cElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

xml_data_to_csv =open('Out.csv', 'w')

list_head=[]

Csv_writer=csv.writer(xml_data_to_csv)

count=0
for element in root.findall('data'): 
    List_nodes =[]

    if count== 0:
        inspection_number = element.find('inspection_number').tag
        list_head.append(inspection_number)
        
        establishment_name = element.find('establishment_name').tag
        list_head.append(establishment_name)
        
        city = element.find('city')
        list_head.append(city)

        state = element.find('state')
        list_head.append(state)
        
        zip_code = element.find('zip_code')
        list_head.append(zip_code)
        
        sic_code = element.find('sic_code')
        list_head.append(sic_code)
        
        naics_code = element.find('naics_code')
        list_head.append(naics_code)
        
        sampling_number = element.find('sampling_number')
        list_head.append(sampling_number)
        
        office_id = element.find('office_id')
        list_head.append(office_id)
        
        date_sampled = element.find('date_sampled')
        list_head.append(date_sampled)
        
        date_reported = element.find('date_reported')
        list_head.append(date_reported)
        
        eight_hour_twa_calc = element.find('eight_hour_twa_calc')
        list_head.append(eight_hour_twa_calc)
        
        instrument_type = element.find('instrument_type')
        list_head.append(instrument_type)
        
        lab_number = element.find('lab_number')
        list_head.append(lab_number)
        
        field_number = element.find('field_number')
        list_head.append(field_number)
        
        sample_type = element.find('sample_type')
        list_head.append(sample_type)
        
        blank_used = element.find('blank_used')
        list_head.append(blank_used)
        
        time_sampled = element.find('time_sampled')
        list_head.append(time_sampled)
        
        air_volume_sampled = element.find('air_volume_sampled')
        list_head.append(air_volume_sampled)
        
        sample_weight = element.find('sample_weight')
        list_head.append(sample_weight)
        
        imis_substance_code = element.find('imis_substance_code')
        list_head.append(imis_substance_code)
        
        substance = element.find('substance')
        list_head.append(substance)
        
        sample_result = element.find('sample_result')
        list_head.append(sample_result)
        
        unit_of_measurement = element.find('unit_of_measurement')
        list_head.append(unit_of_measurement)
        
        qualifier = element.find('qualifier')
        list_head.append(qualifier)

        Csv_writer.writerow(list_head)
        count = +1

    inspection_number = element.find('inspection_number').text
    List_nodes.append(inspection_number)

    establishment_name = element.find('establishment_name').text
    List_nodes.append(establishment_name)

    city = element.find('city').text
    List_nodes.append(city)

    state = element.find('state').text
    List_nodes.append(state)

    zip_code = element.find('zip_code').text
    List_nodes.append(zip_code)    

    sic_code = element.find('sic_code').text
    List_nodes.append(sic_code)

    naics_code = element.find('naics_code').text
    List_nodes.append(naics_code)

    sampling_number = element.find('sampling_number').text
    List_nodes.append(sampling_number)

    office_id = element.find('office_id').text
    List_nodes.append(office_id)

    date_sampled = element.find('date_sampled').text
    List_nodes.append(date_sampled)

    date_reported = element.find('date_reported').text
    List_nodes.append(date_reported)

    eight_hour_twa_calc = element.find('eight_hour_twa_calc').text
    List_nodes.append(eight_hour_twa_calc)    
    
    instrument_type = element.find('instrument_type').text
    List_nodes.append(instrument_type)

    lab_number = element.find('lab_number').text
    List_nodes.append(lab_number)

    field_number = element.find('field_number').text
    List_nodes.append(field_number)

    sample_type = element.find('sample_type').text
    List_nodes.append(sample_type)

    blank_used = element.find('blank_used').text
    List_nodes.append()

    time_sampled = element.find('time_sampled').text
    List_nodes.append(time_sampled)

    air_volume_sampled = element.find('air_volume_sampled').text
    List_nodes.append(air_volume_sampled)    
    
    sample_weight = element.find('sample_weight').text
    List_nodes.append(sample_weight)

    imis_substance_code = element.find('imis_substance_code').text
    List_nodes.append(imis_substance_code)

    substance = element.find('substance').text
    List_nodes.append(substance)

    sample_result = element.find('sample_result').text
    List_nodes.append(sample_result)

    unit_of_measurement = element.find('unit_of_measurement').text 
    List_nodes.append(unit_of_measurement)

    qualifier= element.find('qualifier').text
    List_nodes.append(qualifier)

    Csv_writer.writerow(List_nodes)

xml_data_to_csv.close()

But when I run the code I get a CSV with nothing in it. I suspect this may have something to do with the XSD file associated with the XML, but I'm not totally sure.

Does anyone know what the issue is here?

Answer 1

The code below is a 'compact' version of your code.

It assumes that the XML structure looks like in the script variable xml . (Based on https://www.osha.gov/opengov/sample_data_2011.zip )

The main difference bwtween this sample code and yours is that I define the fields that I want to collect once (see FIELDS ) and I use this definition across the script.

import xml.etree.ElementTree as ET

FIELDS = ['lab_number', 'instrument_type']  # TODO add more fields

xml = '''<main xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="health_sample_data.xsd">
  <DATA_RECORD>
    <inspection_number>316180165</inspection_number>
    <establishment_name>PROFESSIONAL ENGINEERING SERVICES, LLC.</establishment_name>
    <city>EUFAULA</city>
    <state>AL</state>
    <zip_code>36027</zip_code>
    <sic_code>1799</sic_code>
    <naics_code>238990</naics_code>
    <sampling_number>434866166</sampling_number>
    <office_id>418600</office_id>
    <date_sampled>2011-12-30</date_sampled>
    <date_reported>2011-12-30</date_reported>
    <eight_hour_twa_calc>N</eight_hour_twa_calc>
    <instrument_type>TBD</instrument_type>
    <lab_number>L13645</lab_number>
    <field_number>S1</field_number>
    <sample_type>B</sample_type>
    <blank_used>N</blank_used>
    <time_sampled></time_sampled>
    <air_volume_sampled></air_volume_sampled>
    <sample_weight></sample_weight>
    <imis_substance_code>S777</imis_substance_code>
    <substance>Soil</substance>
    <sample_result>0</sample_result>
    <unit_of_measurement>AAAAA</unit_of_measurement>
    <qualifier></qualifier>
  </DATA_RECORD>
  <DATA_RECORD>
    <inspection_number>315516757</inspection_number>
    <establishment_name>MARGUERITE CONCRETE CO.</establishment_name>
    <city>WORCESTER</city>
    <state>MA</state>
    <zip_code>1608</zip_code>
    <sic_code>1771</sic_code>
    <naics_code>238110</naics_code>
    <sampling_number>423259902</sampling_number>
    <office_id>112600</office_id>
    <date_sampled>2011-12-30</date_sampled>
    <date_reported>2011-12-30</date_reported>
    <eight_hour_twa_calc>N</eight_hour_twa_calc>
    <instrument_type>GRAV</instrument_type>
    <lab_number>L13355</lab_number>
    <field_number>9831B</field_number>
    <sample_type>P</sample_type>
    <blank_used>N</blank_used>
    <time_sampled>184</time_sampled>
    <air_volume_sampled>340.4</air_volume_sampled>
    <sample_weight>.06</sample_weight>
    <imis_substance_code>9135</imis_substance_code>
    <substance>Particulates not otherwise regulated (Total Dust)</substance>
    <sample_result>0.176</sample_result>
    <unit_of_measurement>M</unit_of_measurement>
    <qualifier></qualifier>
  </DATA_RECORD></main>'''

root = ET.fromstring(xml)
records = root.findall('.//DATA_RECORD')
with open('out.csv', 'w') as out:
    out.write(','.join(FIELDS) + '\n')
    for record in records:
        values = [record.find(f).text for f in FIELDS]
        out.write(','.join(values) + '\n')

out.csv

lab_number,instrument_type
L13645,TBD
L13355,GRAV

Parse xml w/ xsd to CSV with Python?

Question

1 answers

solution1
0 2019-12-04 08:45:41

Parse xml w/ xsd to CSV with Python?

Question

1 answers

solution1 0 2019-12-04 08:45:41

solution1
0 2019-12-04 08:45:41