使用R或AWK解析類似JSON的配置文件

Question

我需要您的幫助，就像我多年以前在AWK工作一樣，現在我的知識還很生疏。 盡管通過閱讀一些指南刷新了我的記憶，但我確信我的代碼中包含一些錯誤。 我在SO上閱讀過的大多數相關問題都與解析標准JSON ，因此該建議不適用於我的情況。 唯一接近我正在尋找的答案是此SO問題的可接受答案：使用awk sed解析更新人偶文件。 但是我正在嘗試實現兩遍解析 ，而我在該答案中看不到它（或者理解得不夠充分）。

在考慮了其他選項（從R本身到m4以及介於兩者之間的各種模板引擎）之后，我考慮過僅通過jsonlite和stringr包在R中實現該解決方案，但這並不優雅。 我決定編寫一個簡短的AWK腳本，該腳本將解析我R項目的數據收集配置文件，然后再由我的R代碼讀取它們。 此類文件大部分是JSON文件，但有一些補充：

1）它包含作為參數的 嵌入式變量 ，它們引用同一文件中的JSON元素的值（為簡單起見，我決定將其放置在JSON樹的根目錄中）；

2）通過在相應元素的名稱之前緊跟一個星號（*）來表示參數。

最初，我計划了兩種類型的嵌入式變量 ，您可以在這里看到它們- 內部（對同一文件中JSON元素的引用，格式： ${var} ）和外部（用戶提供，格式： %{var} ）。 但是，我仍然不清楚為外部參數傳遞值的機制和好處，因此，我目前僅專注於僅使用內部變量解析配置文件。 因此，請暫時不考慮外部變量。

配置文件示例 ：

{
   "*source":"SourceForge",
   "*action":"import",
   "*schema":"sf0314",
   "data":[
      {
         "indicatorName":"test1",
         "indicatorDescription":"Test Indicator 1",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
      },
      {
         "indicatorName":"test2",
         "indicatorDescription":"Test Indicator 2",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * 
                       FROM sf1104.users a, sf1104.artifact b 
                       WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
      },
      {
         "indicatorName":"totalProjects",
         "indicatorDescription":"Total number of unique projects",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT COUNT(DISTINCT group_id) FROM ${schema}.user_group"
      },
      {
         "indicatorName":"totalDevs",
         "indicatorDescription":"Total number of developers per project",
         "indicatorType":"numeric",
         "resultType":"data.frame",
         "requestSQL":"SELECT COUNT(*) FROM ${schema}.user_group WHERE group_id = %{group_id}"
      }
   ]
}

AWK腳本 ：

#!/usr/bin/awk -f

BEGIN {
  first_pass = true;
  param = "\"\*[a-zA-Z^0-9]+?\"";
  regex = "\$\{[a-zA-Z^0-9]+?\}";
  params[""] = 0;
}

{
  if (first_pass)
    if (match($0, param)) {
      print(substr($0, RSTART, RLENGTH));
      params[param] = substr($0, RSTART, RLENGTH);
    }
  else
      gsub(regex, params[regex], $0);
}

END {
  if (first_pass) {
    ARGC++;
    ARGV[ARGIND++] = FILENAME;
    first_pass = false;
    nextfile;
  }
}

任何幫助都感激不盡！ 謝謝！

更新（基於G. Grothendieck的回答）

以下代碼（包裝在函數中，並從原始答案中稍作修改）的行為不正確，意外地輸出了所有標記（帶有“ _”）配置鍵的值，而不是僅輸出所引用的配置鍵的值：

generateConfig <- function(configTemplate, configFile) {

  suppressPackageStartupMessages(suppressWarnings(library(tcltk)))
  if (!require(gsubfn)) install.packages('gsubfn')
  library(gsubfn)

  regexKeyValue <- '"_([^"]*)":"([^"]*)"'
  regexVariable <- "[$]{([[:alpha:]][[:alnum:].]*)}"

  cfgTmpl <- readLines(configTemplate)

  defns <- strapplyc(cfgTmpl, regexKeyValue, simplify = rbind)
  dict <- setNames(defns[, 2], defns[, 1])
  config <- gsubfn(regexVariable, dict, cfgTmpl)

  writeLines(config, con = configFile)
}

該函數的調用方式如下：

if (updateNeeded()) {
  <...>
  generateConfig(SRDA_TEMPLATE, SRDA_CONFIG)
}

更新2（根據G. Grothendieck的請求）

函數updateNeeded()檢查兩個文件的存在和修改時間，然后根據邏輯確定是否需要（重新）生成配置。 文件（返回boolean ）。

以下是模板配置文件的內容（ SRDA_TEMPLATE <- "./SourceForge.cfg.tmpl" ）：

{
   "_source":"SourceForge",
   "_action":"import",
   "_schema":"sf0314",
   "data":[
      {
         "indicatorName":"test1",
         "indicatorDescription":"Test Indicator 1",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
      },
      {
         "indicatorName":"test2",
         "indicatorDescription":"Test Indicator 2",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * 
                       FROM sf1104.users a, sf1104.artifact b 
                       WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
      },
      {
         "indicatorName":"totalProjects",
         "indicatorDescription":"Total number of unique projects",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT COUNT(DISTINCT group_id) FROM ${schema}.user_group"
      },
      {
         "indicatorName":"totalDevs",
         "indicatorDescription":"Total number of developers per project",
         "indicatorType":"numeric",
         "resultType":"data.frame",
         "requestSQL":"SELECT COUNT(*) FROM ${schema}.user_group WHERE group_id = 78745"
      }
   ]
}

以下是自動生成的配置文件（ SRDA_CONFIG <- "./SourceForge.cfg.json" ）的內容：

{
   "_source":"SourceForge",
   "_action":"import",
   "_schema":"sf0314",
   "data":[
      {
         "indicatorName":"test1",
         "indicatorDescription":"Test Indicator 1",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * FROM sf0305.users WHERE user_id < 100"
      },
      {
         "indicatorName":"test2",
         "indicatorDescription":"Test Indicator 2",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT * 
                       FROM sf1104.users a, sf1104.artifact b 
                       WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"
      },
      {
         "indicatorName":"totalProjects",
         "indicatorDescription":"Total number of unique projects",
         "indicatorType":"numeric",
         "resultType":"numeric",
         "requestSQL":"SELECT COUNT(DISTINCT group_id) FROM SourceForge import sf0314.user_group"
      },
      {
         "indicatorName":"totalDevs",
         "indicatorDescription":"Total number of developers per project",
         "indicatorType":"numeric",
         "resultType":"data.frame",
         "requestSQL":"SELECT COUNT(*) FROM SourceForge import sf0314.user_group WHERE group_id = 78745"
      }
   ]
}

注意SourceForge和import ，在sf0314之前意外填充。

答案作者的幫助將不勝感激！

Answer 1

我假設目標是用星線給出的定義替換${...}每個出現次數。 在帖子中，它表明您正在查看awk，因為R解決方案並不完美，但我認為這可能是由於使用R所采用的方法，並且我假設如果使用不同的方法得出的R解決方案仍然可以接受一個相當緊湊的解決方案。

這里config.json是輸入json文件的名稱，config.out.json是已替換定義的輸出文件。

我們讀入文件，並使用strapplyc提取定義defns的2列矩陣。 我們將其改寫為向量dict ，其值是變量的值，名稱是變量的名稱。 然后，我們使用gsubfn使用dict列表插入定義。 最后，我們將其寫回。

library(gsubfn)

Lines <- readLines("config.json")

defns <- strapplyc(Lines, '"\\*([^"]*)":"([^"]*)"', simplify = rbind)
dict <- setNames(as.list(defns[, 2]), defns[, 1])
Lines.out <- gsubfn("[$]{([[:alpha:]][[:alnum:].]*)}", dict, Lines)

writeLines(Lines.out, con = "config.out.json")

REVISED dict應該是列表，而不是命名的字符向量。

Answer 2

我相信：

#!/usr/bin/awk -f

BEGIN {
  param = "\"\\*([a-zA-Z]+?)\":\"([^\"]*)\"";
  regex = "\\${([a-zA-Z]+?)}";
}

NR == FNR {
    if (match($0, param, a)) {
      params[a[1]] = a[2]
    }
    next
}

match($0, regex, a) {
  gsub(regex, params[a[1]], $0);
}
1

為給定的輸入執行您想要的操作（當以awk -f file.awk input.conf input.conf運行時）。

使用R或AWK解析類似JSON的配置文件

問題描述

2 個解決方案

解決方案1
4 已采納 2014-04-25 13:54:45

解決方案2
2 2014-04-25 14:53:58

使用R或AWK解析類似JSON的配置文件

問題描述

2 個解決方案

解決方案1 4 已采納 2014-04-25 13:54:45

解決方案2 2 2014-04-25 14:53:58

解決方案1
4 已采納 2014-04-25 13:54:45

解決方案2
2 2014-04-25 14:53:58