简体   繁体   English

如何读取多个XML文件,然后输出到具有相同XML文件名的多个CSV文件

[英]How to read multiple XML files then output to multiple CSV files with the same XML filenames

I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns. 我正在尝试解析多个XML文件,然后将它们输出到CSV文件中以列出正确的行和列。

I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name: 我能够通过定义文件名一次处理一个文件来实现,特别是将它们输出到定义的输出文件名中:

File.open('H:/output/xmloutput.csv','w')

I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. 我想写入多个文件,并使其名称与XML文件名相同,而无需对其进行硬编码。 I tried doing it multiple ways but have had no luck so far. 我尝试过多种方式但到目前为止没有运气。

Sample XML: 示例XML:

<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
    <record:name>Bob Chuck</record:name>
    <record:Address_Data>
        <record:Street_Address>123 Main St</record:Street_Address>
        <record:Postal_Code>12345</record:Postal_Code>
    </record:Address_Data>
    <record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>

Here is what I've tried: 这是我尝试过的:

require 'nokogiri'
require 'set'

files = ''
input_folder = "H:/input"
output_folder = "H:/output"

if input_folder[input_folder.length-1,1] == '/'
   input_folder = input_folder[0,input_folder.length-1]
end

if output_folder[output_folder.length-1,1] != '/'
   output_folder = output_folder + '/'
end


files   = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file    = File.read(input_folder + '/' + files)
doc     = Nokogiri::XML(file)
record  = {} # hashes
keys    = Set.new
records = [] # array
csv     = ""

doc.traverse do |node| 
  value = node.text.gsub(/\n +/, '')
    if node.name != "text" # skip these nodes: if class isnt text then skip
      if value.length > 0 # skip empty nodes
        key = node.name.gsub(/wd:/,'').to_sym
        if key == :Dataload_Request && !record.empty?
          records << record
          record = {}
        elsif key[/^root$|^document$/]
          # neglect these keys
        else
          key = node.name.gsub(/wd:/,'').to_sym
          # in case our value is html instead of text
          record[key] = Nokogiri::HTML.parse(value).text
          # add to our key set only if not already in the set
          keys << key
        end
      end
    end
  end

# build our csv
File.open('H:/output/.*csv', 'w') do |file|
  file.puts %Q{"#{keys.to_a.join('","')}"}
  records.each do |record|
    keys.each do |key|
      file.write %Q{"#{record[key]}",}
    end
    file.write "\n"
  end
  print ''
  print 'output files ready!'
  print ''
end

I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors. 我一直在'read memory': no implicit conversion of Array into String (TypeError)和其他错误。

Here's a quick peer-review of your code, something like you'd get in a corporate environment... 这是对您的代码的快速同行评审,就像您在企业环境中获得的一样......

Instead of writing: 而不是写:

input_folder = "H:/input"

input_folder[input_folder.length-1,1] == '/' # => false

Consider doing it using the -1 offset from the end of the string to access the character: 考虑使用字符串末尾的-1偏移来访问字符:

input_folder[-1] # => "t"

That simplifies your logic making it more readable because it's lacking unnecessary visual noise: 这简化了您的逻辑,使其更具可读性,因为它缺乏不必要的视觉噪音:

input_folder[-1] == '/' # => false

See [] and []= in the String documentation. 请参阅String文档中的[][]=


This looks like a bug to me: 这看起来像个错误:

files   = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file    = File.read(input_folder + '/' + files)

files is an array of filenames. files是一个文件名数组。 input_folder + '/' + files is appending an array to a string: input_folder + '/' + files将数组附加到字符串:

foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # => 
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~>  from -:9:in `<main>'

How you want to deal with that is left as an exercise for the programmer. 如何处理这个问题留给程序员一个练习。


doc.traverse do |node|

is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. 是因为它避开了Nokogiri能够使用访问器搜索特定标签的力量。 Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. 我们很少需要按标签迭代文档标记,通常只有在我们对其结构和布局进行窥视时才会这样。 traverse is slower so use it as a very last resort. traverse较慢,因此将其作为最后的手段。


length is nice but isn't needed when checking whether a string has content: length很好,但在检查字符串是否包含内容时不需要:

value = 'foo'
value.length > 0 # => true
value > '' # => true

value = ''
value.length > 0 # => false
value > '' # => false

Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds. 来自Java的程序员喜欢使用访问器,但我喜欢懒惰,可能是因为我的C和Perl背景。


Be careful with sub and gsub as they don't do what you're thinking they do. 小心subgsub因为他们没有按照你的想法去做。 Both expect a regular expression, but will take a string which they do a escape on before beginning their scan. 两者都期望一个正则表达式,但是会在开始扫描之前使用一个字符串,然后对它们进行escape

You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string: 你传入一个正则表达式,在这种情况下是正常的,但是如果你不记得模式匹配的所有规则并且gsub扫描到字符串结尾之前它可能会导致意外问题:

foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"

In general I recommend people think a couple times before using regular expressions. 一般来说,我建议人们在使用正则表达式之前考虑几次。 I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. 我已经看到在相当高级的程序员编写的逻辑中打开了一些漏洞,因为他们不知道引擎会做什么。 They're wonderfully powerful, but need to be used surgically, not as a universal solution. 它们非常强大,但需要手术使用,而不是通用的解决方案。

The same thing happens with a string, because gsub doesn't know when to quit: 字符串gsub发生同样的事情,因为gsub不知道何时退出:

key = foo.gsub('wd:','') # => "bar"

So, if you're looking to change just the first instance use sub : 所以,如果你想改变第一个实例使用sub

key = foo.sub('wd:','') # => "barwd:"

I'd do it a little differently though. 我会做一点不同的事情。

foo = 'wd:bar'

I can check to see what the first three characters are: 我可以查看前三个字符是什么:

foo[0,3] # => "wd:"

Or I can replace them with something else using string indexing: 或者我可以使用字符串索引替换其他内容:

foo[0,3] = '' 
foo # => "bar"

There's more but I think that's enough for now. 还有更多,但我认为现在已足够了。

You should use Ruby's CSV class. 您应该使用Ruby的CSV类。 Also, you don't need to do any string matching or regex stuff. 此外,您不需要执行任何字符串匹配或正则表达式的东西。 Use Nokogiri to target elements. 使用Nokogiri来定位元素。 If you know the node names in the XML will be consistent it should be pretty simple. 如果您知道XML中的节点名称将是一致的,那么它应该非常简单。 I'm not exactly sure if this is the output you want, but this should get you in the right direction: 我不确定这是否是您想要的输出,但这应该让您朝着正确的方向前进:

require 'nokogiri'
require 'csv'

def xml_to_csv(filename)
  xml_str = File.read(filename)
  xml_str.gsub!('record:','') # remove the record: namespace
  doc = Nokogiri::XML xml_str
  csv_filename = filename.gsub('.xml', '.csv')

  CSV.open(csv_filename, 'wb' ) do |row|
    row << ['name', 'street_address', 'postal_code', 'age']
    row << [
      doc.xpath('//name').text,
      doc.xpath('//Street_Address').text,
      doc.xpath('//Postal_Code').text,
      doc.xpath('//Age').text,
    ]
  end
end

# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM