[英]python .xml and .csv files manipulation
I made a conversion from an .xml file to .csv. 我从.xml文件转换为.csv。 In the .xml file there were some values from the txtDescricao
column of this type: "Logistics, Search and Support."
在.xml文件中,该类型的txtDescricao
列中有一些值: txtDescricao
"Logistics, Search and Support."
Because of this, when I read the file, pandas interprets that comma after Logistics
as a column separator, and throws the rest of the text forward. 因此,当我阅读文件时,pandas将Logistics
后的逗号解释为列分隔符,并将其余文本向前抛出。 I am trying to work around this with the following code: 我正在尝试通过以下代码解决此问题:
in_file = 'dados_limpos_2018.csv'
out_file = 'dados_2018.csv'
output = open(out_file, 'w')
with open(in_file, 'r') as source:
for line in source:
# split by semicolon
data = line.strip().split(';')
# remove all quotes found
data = [t.replace('"','') for t in data]
for item in data[:-1]:
item.replace(',', '')
output.write(''.join(['', item, '',',']))
# write the last item separately, without the trailing ';'
output.write(''.join(['"', item, '"']))
output.write('\n')
output.close()
however, in the line python already interprets the comma as separator and turns it into a semicolon. 但是,在该行中,python已经将逗号解释为分隔符并将其转换为分号。 Here I would like to know: Is there any way I can handle this in the .csv file, or would I have to do this in .xml to .csv conversion? 在这里我想知道:有什么方法可以在.csv文件中进行处理,还是必须在.xml到.csv转换中进行处理? Example of .cs file .cs文件的示例
name, number, sgUF, txtDescricao, year
Romario, 15, RJ, Consultoria, 2018
Ronaldo, 9, RJ, Logistics, Search and Support, 2018
Example .xml file: 示例.xml文件:
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<dados>
<despesa>
<name>Romario</name>
<number>15</number>
<sgUF>RJ</sgUF>
<txtDescricao>Consultoria</txtDescricao>
<year>2018</year>
</despesa>
<despesa>
<name>Ronaldo</name>
<number>9</number>
<sgUF>RJ</sgUF>
<txtDescricao>Logistics, Search and Support</txtDescricao>
<year>2018</year>
</despesa>
</dados>
</xml>
Note: The original file is too large to open in spreadsheet editor. 注意:原始文件太大,无法在电子表格编辑器中打开。
It would be nice if you share your xml file. 如果您共享您的xml文件,那就太好了。
Based on supplied info, 根据提供的信息,
If your xml file data has ,
as value, use different separator(semicolon,tab,space) to form your csv file. 如果您的XML文件中的数据有,
作为值,使用不同的分隔符(分号,制表符,空格),以形成CSV文件。 Or Just replace ,
with null when its in XML file, then convert. 或者只需更换,
使用空当它在XML文件中,然后将其转换。
In both situations, you should handle this while converting from xml to csv. 在这两种情况下,您都应在从xml转换为csv的同时进行处理。 With csv -> csv will be hard to implement and count of , will be unpredictable. 使用csv-> csv将很难实现,并且的计数将不可预测。
EDIT 1: 编辑1:
I suggest to use objectify from lxml. 我建议使用lxml中的objectify。 Dont forget to delete <?xml version="1.0" encoding="UTF-8"?>
from your xml. 不要忘记从您的xml中删除<?xml version="1.0" encoding="UTF-8"?>
。 Solution is below. 解决方案如下。
from lxml import objectify
import csv
file_xml = open('d:\\path\\to\\xml.xml','r')
converted_csv_file = open("converted.csv","w")
xml_string = file_xml.read()
xml_object = objectify.fromstring(xml_string)
csvwriter = csv.writer(converted_csv_file, delimiter=',',lineterminator = '\n')
count = 0
for row in xml_object.dados.despesa:
if count == 0:
csvwriter.writerow([row.name.tag,row.number.tag,row.sgUF.tag,row.txtDescricao.tag,row.year.tag])
csvwriter.writerow([row.name.text,row.number.text,row.sgUF.text,row.txtDescricao.text.replace(',',''),row.year.text])
count += 1
You can install lxml by 您可以通过以下方式安装lxml:
pip install lxml
I modified your function to deal with those cases in the txtDescricao
column. 我在txtDescricao
列中修改了您的函数以处理这些情况。
ncols= 5
index = 3
in_file = 'dados_limpos_2018.csv'
out_file = 'dados_2018.csv'
output = open(out_file, 'w')
with open(in_file, 'r') as source:
for line in source:
# split by colon
data = line.strip().split(',')
# Change third element
data_len = len(data)
if data_len > ncols:
# Join all elements
data[index] = ''.join(data[index:index + 1 + (data_len - ncols)])
data[index + 1:] = data[index + 1 + data_len - ncols:]
# Write columns
output.write(','.join(data[:ncols]))
output.write('\n')
output.close()
Input file: 输入文件:
name, number, sgUF, txtDescricao, year
Romario, 15, RJ, Consultoria, 2018
Ronaldo, 9, RJ, Logistics, Search and Support, 2018
Output file: 输出文件:
name, number, sgUF, txtDescricao, year
Romario, 15, RJ, Consultoria, 2018
Ronaldo, 9, RJ, Logistics Search and Support, 2018
OBS.: I am assuming that this problem only occurs in the txtDecricao
column. OBS:我假设此问题仅出现在txtDecricao
列中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.