简体   繁体   English

Shell脚本将CSV解析为XML查询?

[英]Shell script to parse CSV to an XML query?

I have a list of citations in a csv file that I would like to use to fill out the XML based query form at CrossRef 我在csv文件中有一个引用列表,我想用它在CrossRef上填写基于XML的查询表单

CrossRef provides an XML template (below, with unused fields removed), and I would like to parse the columns of the csv file to fill out repeated fields within the query tag : CrossRef提供了一个XML模板(下面,删除了未使用的字段),我想解析csv文件的列以填写query标记中的重复字段:

 <?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body>
  <query enable-multiple-hits="true"
            list-components="false"
            expanded-results="false" key="key">
    <article_title match="fuzzy"></article_title>
    <author search-all-authors="false"></author>
    <volume></volume>
    <year></year>
    <first_page></first_page>
    <journal_title></journal_title>
  </query>
</body>
</query_batch>

How can this be done in a shell script? 如何在Shell脚本中完成此操作?

sample input: 样本输入:

author,year,article_title,journal_title,volume,first_page
Adler,2006,"Biomass yield and biofuel quality of switchgrass harvested in fall or spring","Agronomy Journal",98,1518
Alexopolou,2008,"Biomass yields for upland and lowland switchgrass varieties grown in the Mediterranean region","Biomass and Bioenergy",32,926
Balasko,1984,"Yield and Quality of Switchgrass Grown without Soil Amendments.","Agronomy Journal",76,204

desired output: 所需的输出:

<?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body>
 <query>
  <author>Adler</author >
  <year>2006</year >
  <article_title>Biomass yield and biofuel quality of switchgrass harvested in fall or spring</article_title >
  <journal_title>Agronomy Journal</journal_title >
  <volume>98</volume >
  <first_page>1518</first_page >
 </query>
 <query>
  <author>Alexopolou</author >
  <year>2008</year >
  <article_title>Biomass yields for upland and lowland switchgrass varieties grown in the Mediterranean region</article_title >
  <journal_title>Biomass and Bioenergy</journal_title >
  <volume>32</volume >
  <first_page>926</first_page >
 </query>
 <query>
  <author>Balasko</author >
  <year>1984</year >
  <article_title>Yield and Quality of Switchgrass Grown without Soil Amendments.</article_title >
  <journal_title>Agronomy Journal</journal_title >
  <volume>76</volume >
  <first_page>204</first_page >
 </query>
</body>

Other questions provide some help on doing this in C# and Java 其他问题为在C#Java中执行此操作提供了一些帮助

#!/usr/bin/awk -f
# XML Attributes Must be Quoted. Attribute values must always be quoted. Either single or double quotes can be used.

BEGIN{
    FS=","
    print "<?xml version = '1.0' encoding='UTF-8'?>"
    print "<query_batch xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' version='2.0' xmlns='http://www.crossref.org/qschema/2.0'"
    print "  xsi:schemaLocation='http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd'>"
    print "<head>"
    print "   <email_address>test@crossref.org</email_address>"
    print "   <doi_batch_id>test</doi_batch_id>"
    print "</head>"
    print "<body>"
}

NR>1{
    print "  <query enable-multiple-hits='true'"
    print "            list-components='false'"
    print "            expanded-results='false' key='key'>"
    print "    <article_title match='fuzzy'>" $3 "</article_title>"
    print "    <author search-all-authors='false'>" $1 "</author>"
    print "    <volume>" $5 "</volume>"
    print "    <year>" $2 "</year>"
    print "    <first_page>" $6 "</first_page>"
    print "    <journal_title>" $4 "</journal_title>"
    print "  </query>"
}

END{
    print "</body>"
    print "</query_batch>"
}

$ awk -f script.awk input.csv

Unlike the approaches using text substitution (ie. awk), this one is guaranteed to always emit a well-formed XML document, with content properly escaped. 与使用文本替换(即awk)的方法不同,可以确保此方法始终发出格式正确的XML文档,并且内容正确地转义。 It's ugly, but it's far more correct. 这很丑陋,但是更正确。 Note that this requires a 3rd-party tool; 注意,这需要第三方工具。 nothing included with the shell proper is capable of safely editing XML. Shell附带的所有内容都不能安全地编辑XML。

First, put a document with no body in template.xml : 首先,将没有body的文档放在template.xml

<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body/>
</query_batch>

Second, build an XMLStarlet command line describing the edits desired, and invoke it: 其次,构建描述所需编辑的XMLStarlet命令行,并调用它:

#!/bin/bash
xmlstarlet_command=( )
read_header=0
while IFS=, read author year article_title journal_title volume first_page; do
  if (( read_header == 0 )); then read_header=1; continue; fi
  xmlstarlet_command+=( -s /qs:query_batch/qs:body -t elem -n query -v '' )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n enable-multiple-hits -v true )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n list-components -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n expanded-results -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n key -v key )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n enable-multiple-hits -v true )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n article_title -v "$article_title" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]/article-title' -t attr -n match -v fuzzy )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n author -v "$author" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]/author' -t attr -n search-all-authors -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n volume -v "$volume" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n year -v "$year" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n first_page -v "$first_page" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n journal_title -v "$journal_title" )
done <in.csv
xmlstarlet ed -N qs=http://www.crossref.org/qschema/2.0 "${xmlstarlet_command[@]}" <template.xml

Note that, like other solutions given here, this doesn't strip the double quotes from the beginning and end of the CSV elements; 请注意,就像这里提供的其他解决方案一样,这不会从CSV元素的开头和结尾去除双引号; like other aspects of advanced CSV parsing, this is better left to something like the Python CSV module, which actually knows how to recognize escaped quotes, text fields containing newlines, and all the other little oddities that can happen inside valid CSV files. 像高级CSV解析的其他方面一样,最好留给Python CSV模块之类的东西,它实际上知道如何识别转义的引号,包含换行符的文本字段以及有效CSV文件中可能发生的所有其他小问题。

As an aside -- be aware that older versions of XMLStarlet have a limit on the number of operations per invocation fixed in the latest release. 顺便说一句 -请注意,较旧的XMLStarlet版本对最新版本中固定的每次调用操作数限制 I have a workaround for this (which also allows edit lists longer than the ~32K or so maximum command line length), but it probably deserves to be its own question. 我对此有一个解决方法(它也允许编辑列表的长度超过〜32K左右最大命令行长度),但它可能应该成为自己的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM