简体   繁体   中英

Converting a weirdly formatted XML file to CSV using python

I have this weird XML document that contains Phone number details, I need to export this into a CSV document but the problem is it's not formatted correctly. All of the elements are inside of </ string> tags and some "Name" fields are repeated but not in the exact same way (like in the example below, most repeated lines contain extra spaces or commas). And all the "Numbers" are indented from the "Name" fields.

        <string>example1</string>
            <string>014584111</string>

        <string>example2</string>
            <string>04561212123</string>

        <string>example3</string>
            <string>+1 156151561</string>

        <string>example4</string>
            <string>564513212</string>
        
        <string>example3, </string>
        <string>example4  </string>

How can I convert this into a CSV format without the repeated content using python? Here's an example output

FullName  PhoneNumber
  
example1  014584111
example2  014584111    
example3  +1 156151561  
example4  564513212 

Of course, this can be done. If You can describe the process in human language, You also can program it.

Example:

  • read the file (line by line? or does the file fit into memory? )
  • strip off <string> and </string>
    • is the line intended? --> No --> It is a key
    • is the line intended? --> Yes --> It is a value to the last key
  • add the results to a dict
  • write the dict to a.csv file

So - You need now to make some decisions like:

Is the import file huge? Then it will probably not fit into the memory, and we need to process line by line. Or will it fit in memory?

Will this program be needed many times? Or is it just a one-time conversion?

Then You can divide the problems in smaller sub problems, and write some tests for each sub-problem.

You need also consider more circumstances like file size, if is it a one-time script, if there should be error checking (what if there are two intended lines?) etc.

below (do what you need to do with data )

import xml.etree.ElementTree as ET

def is_phone_number(value):
    for x in value:
        if x != '+' and x != ' ' and not x.isnumeric():
            return False
    return True
    
xml = '''<r> <string>example1</string>
            <string>014584111</string>

        <string>example2</string>
            <string>04561212123</string>

        <string>example3</string>
            <string>+1 156151561</string>

        <string>example4</string>
            <string>564513212</string>
        
        <string>example3, </string>
        <string>example4  </string></r>'''
data = []
root = ET.fromstring(xml)
strings = root.findall('.//string')
i = 0
while i < len(strings):
    if is_phone_number(strings[i+1].text):
        data.append({'key': strings[i].text,'value':strings[i+1].text})
    i += 2

print(data)

output

[{'key': 'example1', 'value': '014584111'}, {'key': 'example2', 'value': '04561212123'}, {'key': 'example3', 'value': '+1 156151561'}, {'key': 'example4', 'value': '564513212'}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM