简体   繁体   中英

Parsing XML CDATA section and convert it to CSV using ElementTree python

I want to convert XML files into a CSV file. My XML file consists of different tags and I select some of them that are useful for my work. I want to access only text content between TEXT tags. My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child, when I run my code it just parses the IMAGE tag and shows NaN when I read my CSV file with pandas. I searched about CDATA but I can't find any tag for it to tell the parser that skips IMAGE tag and extract only content in the CDATA section. Also, I tried to delete IMAGE tags from TEXT to fix the problem but when I did that, it deleted all of the TEXT content, also the CDATA section.

My XML pattern is as follow:

<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
</DOC>
</root>

And, Here is my parsing code:

def make_csv(folderpath, xmlfilename, csvwriter, csv_file):
  
  rows = []

  #Parse XML file
  tree = ET.parse(os.path.join(folderpath, xmlfilename))
  root = tree.getroot()
  
  for elem in root.findall("DOC") :
    rows = []

    sentence = elem.find("TEXT")
    if sentence != None:
        sentence = re.sub('\n', '', sent.text)
    rows.append(sentence)

    csvwriter.writerow(rows)
  csv_file.close()

I appreciate any help.

My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child

The below seems to work. The code handle the cases of IMAGE under TEXT and TEXT with no IMAGE under it.

import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
   <DOC>
      <TEXT>
         <IMAGE>/1379/791012/p18-1.jpg</IMAGE>
         <![CDATA[The section I want to access to]]>
      </TEXT>
      <TEXT>
         <![CDATA[more text]]>
      </TEXT>
   </DOC></root>'''

root = ET.fromstring(xml)
texts = root.findall('.//TEXT')
for idx, text in enumerate(texts, start=1):
    data = list(text)[0].tail.strip() if list(text) else text.text.strip()
    print(f'{idx}) {data}')

output

1) The section I want to access to
2) more text

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM