简体   繁体   English

Python读取大型xml文件并保存到csv文件

[英]Python reading large xml file and save to csv file

I have a large xml file like below structure我有一个像下面结构的大 xml 文件

<?xml version="1.0"?>
  <products xmlns="http://data-vocabulary.org/product/">
   <channel>
   <title>Online Store</title>
   <link>https://www.clienturl.com/</link>   
   <product>
   <identifier>DI035AT12JNR</identifier>
   <quantity>1</quantity>
   <fn>Button Fastening Mid Rise Boyfriend Jeans</fn>
   <description>Button Fastening Mid Rise Boyfriend Jeans</description>
  <category>women-clothing &gt; women-clothing-jeans &gt; women-clothing-jeans-straight_jeans</category>
  <currency>SAR</currency>
  <photo>http://clienturl/product/78/6014/v1/1-zoom.jpg</photo>
  <brand>Diesel</brand>
  <url>https://eclient-product-url.html</url>
  <price>1450</price>
  <google_product_category>Apparel &amp; Accessories &gt; Clothing &gt; Pants</google_product_category>
</product>
<product>
  <identifier>DI035AT12JNR</identifier>
  <quantity>1</quantity>
  <fn>Button Fastening Mid Rise Boyfriend Jeans</fn>
  <description>Button Fastening Mid Rise Boyfriend Jeans</description>
  <category>women-clothing &gt; women-clothing-jeans &gt; women-clothing-jeans-straight_jeans</category>
  <currency>SAR</currency>
  <photo>http://clienturl/product/78/6014/v1/1-zoom.jpg</photo>
  <brand>Diesel</brand>
  <url>https://eclient-product-url.html</url>
  <price>1450</price>
  <google_product_category>Apparel &amp; Accessories &gt; Clothing &gt; Pants</google_product_category>
  </product>
  </channel>
  </products>

and here is the python code below这是下面的python代码

   import codecs
   import xml.etree.ElementTree as etree
   xmlfile = 'en-sa.xml'

   def iterate_xml(xmlfile):
   doc = etree.iterparse(xmlfile, events=('start', 'end'))
   _, root = next(doc)
   start_tag = None
   for event, element in doc:
        if event == 'start' and start_tag is None:
            start_tag = element.tag
        if event == 'end' and element.tag == start_tag:
            yield element
            start_tag = None
            root.clear()

   count=0
   for element in iterate_xml(xmlfile):
       for ele in element:
           print ele
       count=count+1
       if count == 5:
           break

which print output like below打印输出如下

<Element '{http://data-vocabulary.org/product/}title' at 0x7efd046f7a10>
<Element '{http://data-vocabulary.org/product/}link' at 0x7efd046f7ad0>
<Element '{http://data-vocabulary.org/product/}product' at 0x7efd046f7d10>
<Element '{http://data-vocabulary.org/product/}product' at 0x7efd04703050>

I want make this xml into csv file like having below cloumns headers我想把这个 xml 变成 csv 文件,就像在 cloumns 标题下面一样

identifier:quantity:fn:description:category:currency:photo:brand:url:price:google_product_category

but didn't get any ideas how to proceed, can someone help me here \\ Thanks in advance但没有任何想法如何继续,有人可以在这里帮助我\\提前致谢

Would suggest using lxml.etree to extract all of the text for this instance it returns a list of strings containing all of the text and tails.建议使用 lxml.etree 提取此实例的所有文本,它返回一个包含所有文本和尾部的字符串列表。

import lxml.etree
text = """<?xml version="1.0"?>
  <products xmlns="http://data-vocabulary.org/product/">
   <channel>
   <title>Online Store</title>
   <link>https://www.clienturl.com/</link>   
   <product>
   <identifier>DI035AT12JNR</identifier>
   <quantity>1</quantity>
   <fn>Button Fastening Mid Rise Boyfriend Jeans</fn>
   <description>Button Fastening Mid Rise Boyfriend Jeans</description>
  <category>women-clothing &gt; women-clothing-jeans &gt; women-clothing-jeans-straight_jeans</category>
  <currency>SAR</currency>
  <photo>http://clienturl/product/78/6014/v1/1-zoom.jpg</photo>
  <brand>Diesel</brand>
  <url>https://eclient-product-url.html</url>
  <price>1450</price>
  <google_product_category>Apparel &amp; Accessories &gt; Clothing &gt; Pants</google_product_category>
</product>
<product>
  <identifier>DI035AT12JNR</identifier>
  <quantity>1</quantity>
  <fn>Button Fastening Mid Rise Boyfriend Jeans</fn>
  <description>Button Fastening Mid Rise Boyfriend Jeans</description>
  <category>women-clothing &gt; women-clothing-jeans &gt; women-clothing-jeans-straight_jeans</category>
  <currency>SAR</currency>
  <photo>http://clienturl/product/78/6014/v1/1-zoom.jpg</photo>
  <brand>Diesel</brand>
  <url>https://eclient-product-url.html</url>
  <price>1450</price>
  <google_product_category>Apparel &amp; Accessories &gt; Clothing &gt; Pants</google_product_category>
  </product>
  </channel>
  </products>""".encode('utf-8')# the library wants bytes so we encode
#  Not needed if reading from a file
doc = lxml.etree.fromstring(text)
print(doc.xpath('//text()'))

Will output all of the text from the XML in a list of strings将在字符串列表中输出 XML 中的所有文本

['\n   ', '\n   ', 'Online Store', '\n   ', 'https://www.clienturl.com/', '   \n   ', '\n   ', 'DI035AT12JNR', '\n   ', '1', '\n   ', 'Button Fastening Mid Rise Boyfriend Jeans', '\n   ', 'Button Fastening Mid Rise Boyfriend Jeans', '\n  ', 'women-clothing > women-clothing-jeans > women-clothing-jeans-straight_jeans', '\n  ', 'SAR', '\n  ', 'http://clienturl/product/78/6014/v1/1-zoom.jpg', '\n  ', 'Diesel', '\n  ', 'https://eclient-product-url.html', '\n  ', '1450', '\n  ', 'Apparel & Accessories > Clothing > Pants', '\n', '\n', '\n  ', 'DI035AT12JNR', '\n  ', '1', '\n  ', 'Button Fastening Mid Rise Boyfriend Jeans', '\n  ', 'Button Fastening Mid Rise Boyfriend Jeans', '\n  ', 'women-clothing > women-clothing-jeans > women-clothing-jeans-straight_jeans', '\n  ', 'SAR', '\n  ', 'http://clienturl/product/78/6014/v1/1-zoom.jpg', '\n  ', 'Diesel', '\n  ', 'https://eclient-product-url.html', '\n  ', '1450', '\n  ', 'Apparel & Accessories > Clothing > Pants', '\n  ', '\n  ', '\n  ']

Can't guarantee this to work when iterating through the entirety of the XML code because you only gave one example.不能保证在遍历整个 XML 代码时能正常工作,因为您只给出了一个示例。 But if the number of categories in the XML is standard you could iterate by product and select the desired indices to add to another list.但是,如果 XML 中的类别数量是标准的,您可以按产品迭代并选择所需的索引以添加到另一个列表中。 Once you have a lists containing (identifier:quantity:fn:description:category:currency:photo:brand:url:price:google_product_category) it should be easy enough to create a pandas dataframe using pandas.DataFrame.append and export to a csv df.to_csv(r'Path where you want to store the exported CSV file\\File Name.csv')一旦你有一个包含 (identifier:quantity:fn:description:category:currency:photo:brand:url:price:google_product_category) 的列表,就应该很容易使用pandas.DataFrame.append创建一个 pandas 数据pandas.DataFrame.append并导出到 csv df.to_csv(r'Path where you want to store the exported CSV file\\File Name.csv')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM