python .xml和.csv文件操作

Question

I made a conversion from an .xml file to .csv. 我从.xml文件转换为.csv。 In the .xml file there were some values from the txtDescricao column of this type: "Logistics, Search and Support." 在.xml文件中，该类型的txtDescricao列中有一些值： txtDescricao "Logistics, Search and Support." Because of this, when I read the file, pandas interprets that comma after Logistics as a column separator, and throws the rest of the text forward. 因此，当我阅读文件时，pandas将Logistics后的逗号解释为列分隔符，并将其余文本向前抛出。 I am trying to work around this with the following code: 我正在尝试通过以下代码解决此问题：

in_file = 'dados_limpos_2018.csv'
out_file = 'dados_2018.csv'
output = open(out_file, 'w')
with open(in_file, 'r') as source:
    for line in source:
    # split by semicolon
        data = line.strip().split(';')             
    # remove all quotes found
        data = [t.replace('"','') for t in data]
        for item in data[:-1]:
            item.replace(',', '')
            output.write(''.join(['', item, '',',']))
            # write the last item separately, without the trailing ';'
        output.write(''.join(['"', item, '"']))
        output.write('\n')
output.close()

however, in the line python already interprets the comma as separator and turns it into a semicolon. 但是，在该行中，python已经将逗号解释为分隔符并将其转换为分号。 Here I would like to know: Is there any way I can handle this in the .csv file, or would I have to do this in .xml to .csv conversion? 在这里我想知道：有什么方法可以在.csv文件中进行处理，还是必须在.xml到.csv转换中进行处理？ Example of .cs file .cs文件的示例

name, number, sgUF, txtDescricao, year
Romario, 15, RJ, Consultoria, 2018
Ronaldo, 9, RJ, Logistics, Search and Support, 2018

Example .xml file: 示例.xml文件：

<?xml version="1.0" encoding="UTF-8"?>
<xml>
    <dados>
          <despesa>
                  <name>Romario</name>
                  <number>15</number>
                  <sgUF>RJ</sgUF>
                  <txtDescricao>Consultoria</txtDescricao>
                  <year>2018</year>
           </despesa>

           <despesa>
                  <name>Ronaldo</name>
                  <number>9</number>
                  <sgUF>RJ</sgUF>
                  <txtDescricao>Logistics, Search and Support</txtDescricao>
                  <year>2018</year>
           </despesa>
     </dados>
</xml>

Note: The original file is too large to open in spreadsheet editor. 注意：原始文件太大，无法在电子表格编辑器中打开。

Answer 1

It would be nice if you share your xml file. 如果您共享您的xml文件，那就太好了。

Based on supplied info, 根据提供的信息，

If your xml file data has , as value, use different separator(semicolon,tab,space) to form your csv file. 如果您的XML文件中的数据有,作为值，使用不同的分隔符（分号，制表符，空格），以形成CSV文件。 Or Just replace , with null when its in XML file, then convert. 或者只需更换,使用空当它在XML文件中，然后将其转换。

In both situations, you should handle this while converting from xml to csv. 在这两种情况下，您都应在从xml转换为csv的同时进行处理。 With csv -> csv will be hard to implement and count of , will be unpredictable. 使用csv-> csv将很难实现，并且的计数将不可预测。

EDIT 1: 编辑1：

I suggest to use objectify from lxml. 我建议使用lxml中的objectify。 Dont forget to delete <?xml version="1.0" encoding="UTF-8"?> from your xml. 不要忘记从您的xml中删除<?xml version="1.0" encoding="UTF-8"?> 。 Solution is below. 解决方案如下。

from lxml import objectify
import csv

file_xml = open('d:\\path\\to\\xml.xml','r')
converted_csv_file = open("converted.csv","w")
xml_string = file_xml.read()
xml_object = objectify.fromstring(xml_string)
csvwriter = csv.writer(converted_csv_file, delimiter=',',lineterminator = '\n')
count = 0
for row in xml_object.dados.despesa:
    if count == 0:
        csvwriter.writerow([row.name.tag,row.number.tag,row.sgUF.tag,row.txtDescricao.tag,row.year.tag])
    csvwriter.writerow([row.name.text,row.number.text,row.sgUF.text,row.txtDescricao.text.replace(',',''),row.year.text])
    count += 1

You can install lxml by 您可以通过以下方式安装lxml：

pip install lxml

Answer 2

I modified your function to deal with those cases in the txtDescricao column. 我在txtDescricao列中修改了您的函数以处理这些情况。

ncols= 5
index = 3
in_file = 'dados_limpos_2018.csv'
out_file = 'dados_2018.csv'
output = open(out_file, 'w')
with open(in_file, 'r') as source:
     for line in source:
         # split by colon
         data = line.strip().split(',')
         # Change third element
         data_len = len(data)
         if  data_len > ncols:
             # Join all elements
             data[index] = ''.join(data[index:index + 1 + (data_len - ncols)])
             data[index + 1:] = data[index + 1 + data_len - ncols:]
         # Write columns
         output.write(','.join(data[:ncols]))
         output.write('\n')
 output.close()

Input file: 输入文件：

name, number, sgUF, txtDescricao, year
Romario, 15, RJ, Consultoria, 2018
Ronaldo, 9, RJ, Logistics, Search and Support, 2018

Output file: 输出文件：

name, number, sgUF, txtDescricao, year
Romario, 15, RJ, Consultoria, 2018
Ronaldo, 9, RJ, Logistics Search and Support, 2018

OBS.: I am assuming that this problem only occurs in the txtDecricao column. OBS：我假设此问题仅出现在txtDecricao列中。

python .xml和.csv文件操作

问题描述

2 个解决方案

解决方案1
1 2019-05-30 13:28:23

解决方案2
1 已采纳 2019-05-30 13:43:59

python .xml和.csv文件操作

问题描述

2 个解决方案

解决方案1 1 2019-05-30 13:28:23

解决方案2 1 已采纳 2019-05-30 13:43:59

解决方案1
1 2019-05-30 13:28:23

解决方案2
1 已采纳 2019-05-30 13:43:59