使用源文件中的數據從XML文件中獲取塊

Question

自從我閱讀了一些有關XML的內容以來，我對這個問題進行了修改。

我有一個文件源文件，其中包含AuthNumbers列表。 111222 111333 111444 etc.

我需要搜索該列表中的數字，並在相應的XML文件中找到它們。 在xml文件中，行格式如下： <trpcAuthCode>111222</trpcAuthCode>

這可以使用grep非常輕松地實現，但是我需要包含事務的整個塊。

該塊以以下內容開頭： <trans type="network sale" recalled="false">或<trans type="network sale" recalled="false" rollback="true">和/或其他一些變體。 實際上，如果可能的話， <trans*>會是最好的。

該塊以</trans>結尾

它不需要優雅或高效。 我只是需要它才能工作。 我懷疑一些交易正在退出，我需要一種快速方法來審查那些沒有被處理的交易。

如果它有幫助這里是一個鏈接到原始（消毒）xml https://www.dropbox.com/s/cftn23tnz8uc9t8/main.xml?dl=0

我想提取的內容： https ： //www.dropbox.com/s/b2bl053nom4brkk/transaction_results.xml？dl = 0

每個結果的大小會有所不同，因為每筆交易的長度差異很大，具體取決於購買的產品數量。 在結果xml中，你看到我根據trpcAuthCode列表111222,111333,111444提取了我需要的xml。

Answer 1

關於XML和awk問題，你經常會發現大師的評論（如果他們的名聲中有一個k ），awk中的XML處理很復雜或不夠。 正如我理解的那樣，腳本是出於個人和/或調試目的而需要的。 為此，我的解決方案應該足夠，但請記住，它不適用於任何合法的XML文件。

根據您的描述，腳本的草圖是：

如果<trans*>匹配則開始錄制。
如果找到<trpcAuthCode>則獲取其內容並與列表進行比較。 如果匹配，請記住輸出塊。
如果</trans>匹配則停止錄制。 如果已啟用輸出，則打印已記錄的塊，否則將其丟棄。

因為我在SO中做了類似的事情：Shell腳本 - 將xml拆分成多個文件，這應該不會太難以實現。

但是，還需要一個額外的功能：將AuthNumbers數組輸入腳本。 由於令人驚訝的巧合，我今天早上在SO中學到了答案：如何訪問awk中的數組，這是在shell中的另一個awk中聲明的？ （感謝jas的評論）。

所以，將它完全放在腳本filter-trpcAuthCode.awk ：

BEGIN {
  record = 0 # state for recording
  buffer = "" # buffer for recording
  found = 0 # state for found auth code
  # build temp. array from authCodes which has to be pre-defined
  split(authCodes, list, "\n")
  # build final array where values become keys
  for (i in list) authCodeList[list[i]]
  # for debugging: output of authCodeList
  print "<!-- authCodeList:"
  for (authCode in authCodeList) {
    print authCode
  }
  print "-->"
}

/<trans( [^>]*)?>/ {
  record = 1 # start recording
  buffer = "" # clear buffer
  found = 0 # reset state for found auth code
}

record {
  buffer = buffer"\n"$0 # record line (if recording is enabled)
}

record && /<trpcAuthCode>/ {
  # extract auth code
  authCode = gensub(/^.*>([^<]*)<\/trpcAuthCode.*$/, "\\1", "g")
  # check whether auth code in authCodeList
  found = authCode in authCodeList
}

/<\/trans>/ {
  record = 0 # stop recording
  # print buffer if auth code has been found
  if (found) {
    print buffer
  }
}

筆記：

我最初在BEGIN authCodes上應用split()時authCodes 。 這使得一個數組中的分割值與枚舉鍵一起存儲。 因此，我尋找一種解決方案，使值本身成為數組的鍵。 （否則， in運算符不能用於搜索。）我在SO的接受答案中找到了一個優雅的解決方案：檢查數組是否包含值。
我將提議的模式<trans*>為/<trans( [^>]*)?/甚至匹配<trans> （盡管<trans>似乎永遠不會在沒有屬性的情況下發生）但不是<transSet> 。
該
buffer = buffer"\\n"$0
將當前行附加到先前的內容。 $0包含沒有換行符的行。 因此，必須重新插入。 我是怎么做到的，緩沖區以換行符開頭但最后一行沒有結束。 考慮到print buffer在文本末尾添加換行符，這對我來說很好。 或者，上面的代碼片段可以替換為
buffer = buffer $0 "\\n"
甚至
buffer = (buffer != "" ? buffer"\\n" : "") $0 。
（這是品味問題。）
過濾后的文件只是打印到標准輸出通道。 它可能會被重定向到一個文件。 考慮到這一點，我將附加/調試輸出格式化為XML注釋。
如果你對awk有點熟悉，你可能會注意到我的腳本中沒有任何next語句。 這是故意的。 換句話說，規則的順序是精心選擇的，以便所有規則可以連續地處理/影響一條線。 （我測試了一個極端情況：
<trans><trpcAuthCode>111222</trpcAuthCode></trans>
甚至這是正確處理的。）

為了簡化測試，我添加了一個包裝器bash腳本filter-trpcAuthCode.sh

#!/usr/bin/bash
# uncomment next line for debugging
#set -x
# check command line arguments
if [[ $# -ne 2 ]]; then
  echo "ERROR: Illegal number of command line arguments!"
  echo ""
  echo "Usage:"
  echo $(basename $0) " XML_FILE AUTH_CODES"
  exit 1
fi
# call awk script
awk -v authCodes="$(cat <$2)" -f filter-xml-trpcAuthCode.awk "$1"

我針對您的示例文件main.xml測試了腳本（在Windows 10上使用cygwin中的bash）並獲得了四個匹配的塊。 我有點擔心輸出，因為在您的示例輸出中， transaction_results.xml只有三個匹配的塊。 但是在視覺上檢查我的輸出似乎是合適的。 （所有四個匹配都包含匹配的<trpcAuthCode>元素。）

我為示例sample.xml減少了一些示例輸入：

<?xml version="1.0"?>
<transSet periodID="1" periodname="Shift" longId="2017-04-27" shortId="052" site="12345">
  <trans type="periodClose">
    <trHeader>
    </trHeader>
  </trans>
  <printCashier>
    <cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier>
  </printCashier>
  <trans type="printCashier">
    <trHeader>
      <cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier>
      <posNum>101</posNum>
    </trHeader>
  </trans>
  <trans type="journal">
    <trHeader>
    </trHeader>
  </trans>
  <trans type="network sale" recalled="false">
    <trHeader>
      <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
    </trHeader>
    <trPaylines>
      <trPayline type="sale" sysid="1" locale="DOLLAR">
        <trpCardInfo>
          <trpcAccount>1234567890123456</trpcAccount>
          <trpcAuthCode>532524</trpcAuthCode>
       </trpCardInfo>
      </trPayline>
    </trPaylines>
  </trans>
  <trans type="network sale" recalled="false">
    <trHeader>
      <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
    </trHeader>
    <trPaylines>
      <trPayline type="sale" sysid="1" locale="DOLLAR">
        <trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode>
        <trpAmt>61.77</trpAmt>
        <trpCardInfo>
          <trpcAccount>2345678901234567</trpcAccount>
          <trpcAuthCode>111222</trpcAuthCode>
        </trpCardInfo>
      </trPayline>
    </trPaylines>
  </trans>
  <trans type="periodClose">
    <trHeader>
      <date>2017-04-27T23:50:17-04:00</date>
    </trHeader>
  </trans>
  <endTotals>
    <insideSales>445938.63</insideSales>
  </endTotals>
</transSet>

對於其他示例輸入，我只是將文本復制到文件authCodes.txt ：

111222
111333
111444

在示例會話中使用兩個輸入文件：

$ ./filter-xml-trpcAuthCode.sh
ERROR: Illegal number of command line arguments!

Usage:
filter-xml-trpcAuthCode.sh XML_FILE AUTH_CODES

$ ./filter-xml-trpcAuthCode.sh sample.xml authCodes.txt
<!-- authCodeList:
111222
111333
111444
-->

  <trans type="network sale" recalled="false">
    <trHeader>
      <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
    </trHeader>
    <trPaylines>
      <trPayline type="sale" sysid="1" locale="DOLLAR">
        <trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode>
        <trpAmt>61.77</trpAmt>
        <trpCardInfo>
          <trpcAccount>2345678901234567</trpcAccount>
          <trpcAuthCode>111222</trpcAuthCode>
        </trpCardInfo>
      </trPayline>
    </trPaylines>
  </trans>

$ ./filter-xml-trpcAuthCode.sh main.xml authCodes.txt >output.txt

$

最后一個命令將輸出output.txt到文件output.txt ，然后可以檢查或處理該文件。

使用源文件中的數據從XML文件中獲取塊

問題描述

1 個解決方案

解決方案1
0 已采納 2017-04-30 08:51:25

使用源文件中的數據從XML文件中獲取塊

問題描述

1 個解決方案

解決方案1 0 已采納 2017-04-30 08:51:25

解決方案1
0 已采納 2017-04-30 08:51:25