簡體   English   中英

Shell腳本將CSV解析為XML查詢?

[英]Shell script to parse CSV to an XML query?

我在csv文件中有一個引用列表,我想用它在CrossRef上填寫基於XML的查詢表單

CrossRef提供了一個XML模板(下面,刪除了未使用的字段),我想解析csv文件的列以填寫query標記中的重復字段:

 <?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body>
  <query enable-multiple-hits="true"
            list-components="false"
            expanded-results="false" key="key">
    <article_title match="fuzzy"></article_title>
    <author search-all-authors="false"></author>
    <volume></volume>
    <year></year>
    <first_page></first_page>
    <journal_title></journal_title>
  </query>
</body>
</query_batch>

如何在Shell腳本中完成此操作?

樣本輸入:

author,year,article_title,journal_title,volume,first_page
Adler,2006,"Biomass yield and biofuel quality of switchgrass harvested in fall or spring","Agronomy Journal",98,1518
Alexopolou,2008,"Biomass yields for upland and lowland switchgrass varieties grown in the Mediterranean region","Biomass and Bioenergy",32,926
Balasko,1984,"Yield and Quality of Switchgrass Grown without Soil Amendments.","Agronomy Journal",76,204

所需的輸出:

<?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body>
 <query>
  <author>Adler</author >
  <year>2006</year >
  <article_title>Biomass yield and biofuel quality of switchgrass harvested in fall or spring</article_title >
  <journal_title>Agronomy Journal</journal_title >
  <volume>98</volume >
  <first_page>1518</first_page >
 </query>
 <query>
  <author>Alexopolou</author >
  <year>2008</year >
  <article_title>Biomass yields for upland and lowland switchgrass varieties grown in the Mediterranean region</article_title >
  <journal_title>Biomass and Bioenergy</journal_title >
  <volume>32</volume >
  <first_page>926</first_page >
 </query>
 <query>
  <author>Balasko</author >
  <year>1984</year >
  <article_title>Yield and Quality of Switchgrass Grown without Soil Amendments.</article_title >
  <journal_title>Agronomy Journal</journal_title >
  <volume>76</volume >
  <first_page>204</first_page >
 </query>
</body>

其他問題為在C#Java中執行此操作提供了一些幫助

#!/usr/bin/awk -f
# XML Attributes Must be Quoted. Attribute values must always be quoted. Either single or double quotes can be used.

BEGIN{
    FS=","
    print "<?xml version = '1.0' encoding='UTF-8'?>"
    print "<query_batch xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' version='2.0' xmlns='http://www.crossref.org/qschema/2.0'"
    print "  xsi:schemaLocation='http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd'>"
    print "<head>"
    print "   <email_address>test@crossref.org</email_address>"
    print "   <doi_batch_id>test</doi_batch_id>"
    print "</head>"
    print "<body>"
}

NR>1{
    print "  <query enable-multiple-hits='true'"
    print "            list-components='false'"
    print "            expanded-results='false' key='key'>"
    print "    <article_title match='fuzzy'>" $3 "</article_title>"
    print "    <author search-all-authors='false'>" $1 "</author>"
    print "    <volume>" $5 "</volume>"
    print "    <year>" $2 "</year>"
    print "    <first_page>" $6 "</first_page>"
    print "    <journal_title>" $4 "</journal_title>"
    print "  </query>"
}

END{
    print "</body>"
    print "</query_batch>"
}

$ awk -f script.awk input.csv

與使用文本替換(即awk)的方法不同,可以確保此方法始終發出格式正確的XML文檔,並且內容正確地轉義。 這很丑陋,但是更正確。 注意,這需要第三方工具。 Shell附帶的所有內容都不能安全地編輯XML。

首先,將沒有body的文檔放在template.xml

<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body/>
</query_batch>

其次,構建描述所需編輯的XMLStarlet命令行,並調用它:

#!/bin/bash
xmlstarlet_command=( )
read_header=0
while IFS=, read author year article_title journal_title volume first_page; do
  if (( read_header == 0 )); then read_header=1; continue; fi
  xmlstarlet_command+=( -s /qs:query_batch/qs:body -t elem -n query -v '' )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n enable-multiple-hits -v true )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n list-components -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n expanded-results -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n key -v key )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n enable-multiple-hits -v true )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n article_title -v "$article_title" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]/article-title' -t attr -n match -v fuzzy )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n author -v "$author" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]/author' -t attr -n search-all-authors -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n volume -v "$volume" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n year -v "$year" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n first_page -v "$first_page" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n journal_title -v "$journal_title" )
done <in.csv
xmlstarlet ed -N qs=http://www.crossref.org/qschema/2.0 "${xmlstarlet_command[@]}" <template.xml

請注意,就像這里提供的其他解決方案一樣,這不會從CSV元素的開頭和結尾去除雙引號; 像高級CSV解析的其他方面一樣,最好留給Python CSV模塊之類的東西,它實際上知道如何識別轉義的引號,包含換行符的文本字段以及有效CSV文件中可能發生的所有其他小問題。

順便說一句 -請注意,較舊的XMLStarlet版本對最新版本中固定的每次調用操作數限制 我對此有一個解決方法(它也允許編輯列表的長度超過〜32K左右最大命令行長度),但它可能應該成為自己的問題。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM