简体   繁体   English

多个 xml 文件使用 python 到 csv

[英]Multiple xml files to csv using python

I am trying to extract specific tags from XML and converting to CSV file.我正在尝试从 XML 中提取特定标签并转换为 CSV 文件。 i was able to this for single XML file which is extracting all the identifier tag in the file.我能够为单个 XML 文件做到这一点,该文件正在提取文件中的所有标识符标签 Here my question is 1) how to extract from multiple XML files to single CSV file and 2) in the given XML file the required tag is mentioned more than once i would like to know how to extract the first identifier tag from each list of record tag .这里我的问题是 1)如何从多个 XML 文件中提取到单个 CSV 文件和 2)在给定的XML文件中,我想知道如何从每个文件列表中多次提到第一个标识符以提取所需的标签标记

Am using python3.7我正在使用python3.7

Required ans is:必需的答案是:

<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>

Note: am not a programmer.!注意:我不是程序员。! appreciate your kind help.感谢您的帮助。

from bs4 import BeautifulSoup as b
import itertools
import os
import csv
import pandas as pd


os.chdir(r"C:*test")

with open("aaaaahbc.xml", "r", encoding="utf-8") as f: # opening xml file
    content = f.read()

soup = b(content, 'lxml')
identifier =  [ values.text for values in soup.findAll("identifier")]

# For python-3.x use `zip_longest` method
# For python-2.x use 'izip_longest method

data = [item for item in itertools.zip_longest(identifier)] 
df  = pd.DataFrame(data=data)
df.to_csv("aaaaahbc.csv",index=True, header=False)

xml file example: xml 文件示例:

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>2020-06-12T05:26:49Z</responseDate>
 <request verb="ListRecords" resumptionToken="2020-05-23T03:32:50Z!2037-01-01T00:00:00Z!!oai_dc!7334186!7353566!oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31648">
    http://union.ndltd.org:8080/union.OAI-PMH/</request>
 <ListRecords>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Influencia de la grasa en las propiedades físicas y sensoriales de galletas. Alternativas para la mejora del perfil de acidos grasos</dc:title>
<dc:creator>Tarancón Serrano, Paula Isabel</dc:creator>
<dc:contributor>Salvador Alcaraz, Ana</dc:contributor>
<dc:contributor>Sanz Taberner, Teresa</dc:contributor>
<dc:contributor>Tarrega Guillem, Amparo</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Escuela Técnica Superior del Medio Rural y Enología - Escola Tècnica Superior del Medi Rural i Enologia</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Instituto Universitario de Ingeniería de Alimentos para el Desarrollo - Institut Universitari d'Enginyeria d'Aliments per al Desenvolupament</dc:contributor>
<dc:subject>Galletas</dc:subject>
<dc:subject>Grasa</dc:subject>
<dc:subject>Propiedades sensoriales</dc:subject>
<dc:subject>Propiedades físicas</dc:subject>
<dc:subject>Mejora del perfil de ácidos grasos</dc:subject>
<dc:date>2013-09-02</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
<dc:identifier>http://hdl.handle.net/10251/31652</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/31652</dc:identifier>
<dc:language>spa</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/31652</identifier>
  <datestamp>2020-05-22T09:32:33Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Sensores químicos cromogénicos y fluorogénicos para la detección de cationes y aniones</dc:title>
<dc:creator>Ábalos Aguado, Tatiana</dc:creator>
<dc:contributor>Martínez Mañez, Ramón</dc:contributor>
<dc:contributor>Sancenón Galarza, Félix</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Química - Departament de Química</dc:contributor>
<dc:subject>Sensores cromogénicos</dc:subject>
<dc:subject>Sensores fluorogénicos</dc:subject>
<dc:subject>Cationes</dc:subject>
<dc:subject>Aniones</dc:subject>
<dc:subject>Química supramolecular</dc:subject>
<dc:subject>QUIMICA INORGANICA</dc:subject>
<dc:subject>QUIMICA ORGANICA</dc:subject>
<dc:date>2013-10-07</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
<dc:identifier>http://hdl.handle.net/10251/32667</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/32667</dc:identifier>
<dc:language>spa</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/32667</identifier>
  <datestamp>2020-05-22T10:52:59Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Comparison of vacuum treatments and traditional cooking in vegetables using instrumental and sensory analysis</dc:title>
<dc:creator>Iborra Bernad, María del Consuelo</dc:creator>
<dc:contributor>García Segovia, Purificación</dc:contributor>
<dc:contributor>Martínez Monzó, Javier</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Tecnología de Alimentos - Departament de Tecnologia d'Aliments</dc:contributor>
<dc:subject>Instrumental texture</dc:subject>
<dc:subject>Puncture test</dc:subject>
<dc:subject>Kramer cell test</dc:subject>
<dc:subject>Texture Profile Analysis</dc:subject>
<dc:subject>Color</dc:subject>
<dc:subject>Antioxidants</dc:subject>
<dc:subject>Anthocyanins</dc:subject>
<dc:subject>Carotenes</dc:subject>
<dc:subject>Ascorbic acid</dc:subject>
<dc:subject>Microstructure</dc:subject>
<dc:subject>Cooking treatment</dc:subject>
<dc:subject>Response Surface Methodology</dc:subject>
<dc:subject>Optimization</dc:subject>
<dc:subject>Sensory Analysis</dc:subject>
<dc:subject>Ranking test</dc:subject>
<dc:subject>Paired test</dc:subject>
<dc:subject>Just About Right</dc:subject>
<dc:subject>Flash Profile</dc:subject>
<dc:subject>Vacuum cooking</dc:subject>
<dc:subject>Sous-vide</dc:subject>
<dc:subject>Cook-vide</dc:subject>
<dc:subject>Vegetables</dc:subject>
<dc:subject>Purple-flesh potatoes</dc:subject>
<dc:subject>Carrots</dc:subject>
<dc:subject>Green beans</dc:subject>
<dc:subject>Red cabbage.</dc:subject>
<dc:subject>TECNOLOGIA DE ALIMENTOS</dc:subject>
<dc:description>Alfresco</dc:description>
<dc:date>2013-10-21</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:type>info:eu-repo/semantics/acceptedVersion</dc:type>
<dc:identifier>http://hdl.handle.net/10251/32953</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/32953</dc:identifier>
<dc:language>eng</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/32953</identifier>
  <datestamp>2020-05-22T09:18:49Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Anàlisi del discurs de la informàtica: aplicació a l'estudi de la descripció</dc:title>
<dc:creator>Montesinos López, Anna Isabel</dc:creator>
<dc:contributor>SALVADOR LIERN, VICENT MANUEL</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Lingüística Aplicada - Departament de Lingüística Aplicada</dc:contributor>
<dc:subject>Discurso</dc:subject>
<dc:subject>Informática</dc:subject>
<dc:subject>FILOLOGIA CATALANA</dc:subject>
<dc:date>2015-11-03</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:identifier>http://hdl.handle.net/10251/56906</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/56906</dc:identifier>
<dc:language>cat</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/56906</identifier>
  <datestamp>2020-05-22T07:41:11Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
  <record>
<header>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
<datestamp>2020-05-23T03:32:50Z</datestamp>
<setSpec>upv.es</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Herramientas para la generación y evaluación ex-ante de modelos de negocio.</dc:title>
<dc:creator>Mateu Céspedes, José María</dc:creator>
<dc:contributor>March Chordà, Isidre</dc:contributor>
<dc:contributor>Universitat Politècnica de València. Departamento de Ingeniería e Infraestructura de los Transportes - Departament d'Enginyeria i Infraestructura dels Transports</dc:contributor>
<dc:subject>Modelos de negocio</dc:subject>
<dc:subject>Evaluación ex-ante</dc:subject>
<dc:subject>INGENIERIA E INFRAESTRUCTURA DE LOS TRANSPORTES</dc:subject>
<dc:date>2015-11-10</dc:date>
<dc:type>info:eu-repo/semantics/doctoralThesis</dc:type>
<dc:identifier>http://hdl.handle.net/10251/57282</dc:identifier>
<dc:identifier>10.4995/Thesis/10251/57282</dc:identifier>
<dc:language>spa</dc:language>
<dc:rights>Reserva de todos los derechos</dc:rights>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:source>Riunet</dc:source>
</oai_dc:dc>

</metadata>
<about>
<provenance
xmlns="http://www.openarchives.org/OAI/2.0/provenance"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance 
http://www.openarchives.org/OAI/2.0/provenance.xsd">
 <originDescription harvestDate="2020-05-23T03:32:50Z" altered="false">
  <baseURL>https://riunet.upv.es/oai/request</baseURL>
  <identifier>oai:riunet.upv.es:10251/57282</identifier>
  <datestamp>2020-05-22T10:29:52Z</datestamp>
  <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
 </originDescription>
</provenance>

</about></record>
<resumptionToken completeListSize="7353566" cursor="7334186">2020-05-29T15:07:21Z!2037-01-01T00:00:00Z!!oai_dc!7335298!7353566!oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:34876</resumptionToken> </ListRecords>
</OAI-PMH>

This script will go through every XML in the directory ( *.xml ) and extract the first <identifier> under the <record> tag:此脚本将 go 通过目录 ( *.xml ) 中的每个 XML 并提取<record>标签下的第一个<identifier>

import csv
import glob
from bs4 import BeautifulSoup

all_data = []
for filename in glob.glob(r'*.xml'):
    with open(filename, 'r') as f_in:
        soup = BeautifulSoup(f_in.read(), 'html.parser')
    print(filename)
    for i in soup.select('record identifier:nth-child(1)'):
        print(i)
        all_data.append([filename, i.get_text(strip=True)])

# write to csv file:
with open('data.csv', 'w', newline='') as csvfile:
    csv_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        csv_writer.writerow(row)

Prints (for example):打印(例如):

a1.xml
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>
a2.xml
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/31652xxx</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32667xxx</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/32953</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/56906</identifier>
<identifier>oai:union.ndltd.org:upv.es/oai:riunet.upv.es:10251/57282</identifier>

And saves data.csv (screenshot from LibreOffice):并保存data.csv (来自 LibreOffice 的屏幕截图):

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM