Parsing XML CDATA section and convert it to CSV using ElementTree python

Question

I want to convert XML files into a CSV file. My XML file consists of different tags and I select some of them that are useful for my work. I want to access only text content between TEXT tags. My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child, when I run my code it just parses the IMAGE tag and shows NaN when I read my CSV file with pandas. I searched about CDATA but I can't find any tag for it to tell the parser that skips IMAGE tag and extract only content in the CDATA section. Also, I tried to delete IMAGE tags from TEXT to fix the problem but when I did that, it deleted all of the TEXT content, also the CDATA section.

My XML pattern is as follow:

<root>
<DOC>
<TEXT>
<IMAGE>/1379/791012/p18-1.jpg</IMAGE>
<![CDATA[The section I want to access to]]>
</TEXT>
</DOC>
</root>

And, Here is my parsing code:

def make_csv(folderpath, xmlfilename, csvwriter, csv_file):
  
  rows = []

  #Parse XML file
  tree = ET.parse(os.path.join(folderpath, xmlfilename))
  root = tree.getroot()
  
  for elem in root.findall("DOC") :
    rows = []

    sentence = elem.find("TEXT")
    if sentence != None:
        sentence = re.sub('\n', '', sent.text)
    rows.append(sentence)

    csvwriter.writerow(rows)
  csv_file.close()

I appreciate any help.

Answer 1

My problem is that I don't know how to access CDATA content. Because TEXT in some DOCs has an IMAGE child

The below seems to work. The code handle the cases of IMAGE under TEXT and TEXT with no IMAGE under it.

import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
   <DOC>
      <TEXT>
         <IMAGE>/1379/791012/p18-1.jpg</IMAGE>
         <![CDATA[The section I want to access to]]>
      </TEXT>
      <TEXT>
         <![CDATA[more text]]>
      </TEXT>
   </DOC></root>'''

root = ET.fromstring(xml)
texts = root.findall('.//TEXT')
for idx, text in enumerate(texts, start=1):
    data = list(text)[0].tail.strip() if list(text) else text.text.strip()
    print(f'{idx}) {data}')

output

1) The section I want to access to
2) more text

Parsing XML CDATA section and convert it to CSV using ElementTree python

Question

1 answers

solution1
1 ACCPTED 2021-08-19 11:55:28

Parsing XML CDATA section and convert it to CSV using ElementTree python

Question

1 answers

solution1 1 ACCPTED 2021-08-19 11:55:28

solution1
1 ACCPTED 2021-08-19 11:55:28