简体   繁体   English

python .xml和.csv文件操作

[英]python .xml and .csv files manipulation

I made a conversion from an .xml file to .csv. 我从.xml文件转换为.csv。 In the .xml file there were some values ​​from the txtDescricao column of this type: "Logistics, Search and Support." 在.xml文件中,该类型的txtDescricao列中有一些值: txtDescricao "Logistics, Search and Support." Because of this, when I read the file, pandas interprets that comma after Logistics as a column separator, and throws the rest of the text forward. 因此,当我阅读文件时,pandas将Logistics后的逗号解释为列分隔符,并将其余文本向前抛出。 I am trying to work around this with the following code: 我正在尝试通过以下代码解决此问题:

in_file = 'dados_limpos_2018.csv'
out_file = 'dados_2018.csv'
output = open(out_file, 'w')
with open(in_file, 'r') as source:
    for line in source:
    # split by semicolon
        data = line.strip().split(';')             
    # remove all quotes found
        data = [t.replace('"','') for t in data]
        for item in data[:-1]:
            item.replace(',', '')
            output.write(''.join(['', item, '',',']))
            # write the last item separately, without the trailing ';'
        output.write(''.join(['"', item, '"']))
        output.write('\n')
output.close()

however, in the line python already interprets the comma as separator and turns it into a semicolon. 但是,在该行中,python已经将逗号解释为分隔符并将其转换为分号。 Here I would like to know: Is there any way I can handle this in the .csv file, or would I have to do this in .xml to .csv conversion? 在这里我想知道:有什么方法可以在.csv文件中进行处理,还是必须在.xml到.csv转换中进行处理? Example of .cs file .cs文件的示例

name, number, sgUF, txtDescricao, year
Romario, 15, RJ, Consultoria, 2018
Ronaldo, 9, RJ, Logistics, Search and Support, 2018

Example .xml file: 示例.xml文件:

<?xml version="1.0" encoding="UTF-8"?>
<xml>
    <dados>
          <despesa>
                  <name>Romario</name>
                  <number>15</number>
                  <sgUF>RJ</sgUF>
                  <txtDescricao>Consultoria</txtDescricao>
                  <year>2018</year>
           </despesa>

           <despesa>
                  <name>Ronaldo</name>
                  <number>9</number>
                  <sgUF>RJ</sgUF>
                  <txtDescricao>Logistics, Search and Support</txtDescricao>
                  <year>2018</year>
           </despesa>
     </dados>
</xml>

Note: The original file is too large to open in spreadsheet editor. 注意:原始文件太大,无法在电子表格编辑器中打开。

It would be nice if you share your xml file. 如果您共享您的xml文件,那就太好了。

Based on supplied info, 根据提供的信息,

If your xml file data has , as value, use different separator(semicolon,tab,space) to form your csv file. 如果您的XML文件中的数据有,作为值,使用不同的分隔符(分号,制表符,空格),以形成CSV文件。 Or Just replace , with null when its in XML file, then convert. 或者只需更换,使用空当它在XML文件中,然后将其转换。

In both situations, you should handle this while converting from xml to csv. 在这两种情况下,您都应在从xml转换为csv的同时进行处理。 With csv -> csv will be hard to implement and count of , will be unpredictable. 使用csv-> csv将很难实现,并且的计数将不可预测。

EDIT 1: 编辑1:

I suggest to use objectify from lxml. 我建议使用lxml中的objectify。 Dont forget to delete <?xml version="1.0" encoding="UTF-8"?> from your xml. 不要忘记从您的xml中删除<?xml version="1.0" encoding="UTF-8"?> Solution is below. 解决方案如下。

from lxml import objectify
import csv

file_xml = open('d:\\path\\to\\xml.xml','r')
converted_csv_file = open("converted.csv","w")
xml_string = file_xml.read()
xml_object = objectify.fromstring(xml_string)
csvwriter = csv.writer(converted_csv_file, delimiter=',',lineterminator = '\n')
count = 0
for row in xml_object.dados.despesa:
    if count == 0:
        csvwriter.writerow([row.name.tag,row.number.tag,row.sgUF.tag,row.txtDescricao.tag,row.year.tag])
    csvwriter.writerow([row.name.text,row.number.text,row.sgUF.text,row.txtDescricao.text.replace(',',''),row.year.text])
    count += 1

You can install lxml by 您可以通过以下方式安装lxml:

pip install lxml

I modified your function to deal with those cases in the txtDescricao column. 我在txtDescricao列中修改了您的函数以处理这些情况。

ncols= 5
index = 3
in_file = 'dados_limpos_2018.csv'
out_file = 'dados_2018.csv'
output = open(out_file, 'w')
with open(in_file, 'r') as source:
     for line in source:
         # split by colon
         data = line.strip().split(',')
         # Change third element
         data_len = len(data)
         if  data_len > ncols:
             # Join all elements
             data[index] = ''.join(data[index:index + 1 + (data_len - ncols)])
             data[index + 1:] = data[index + 1 + data_len - ncols:]
         # Write columns
         output.write(','.join(data[:ncols]))
         output.write('\n')
 output.close()

Input file: 输入文件:

name, number, sgUF, txtDescricao, year
Romario, 15, RJ, Consultoria, 2018
Ronaldo, 9, RJ, Logistics, Search and Support, 2018

Output file: 输出文件:

name, number, sgUF, txtDescricao, year
Romario, 15, RJ, Consultoria, 2018
Ronaldo, 9, RJ, Logistics Search and Support, 2018

OBS.: I am assuming that this problem only occurs in the txtDecricao column. OBS:我假设此问题仅出现在txtDecricao列中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM