简体   繁体   English

Bash / * NIX:将文件拆分为子字符串上的多个文件

[英]Bash/*NIX: split a file into multiple files on a substring

Variants of this question have been asked and answered before, but I find that my sed/grep/awk skills are far too rudimentary to work from those to a custom solution since I hardly ever work in shell scripts. 之前已经问过并回答过这个问题的变种,但是我发现我的sed / grep / awk技能从那些工作到自定义解决方案太过简陋,因为我几乎没有在shell脚本中工作。

I have a rather large (100K+ lines) text file in which each line defines a GeoJSON object, each such object including a property called "county" (there are, all told, 100 different counties). 我有一个相当大的(100K +行)文本文件,其中每行定义一个GeoJSON对象,每个这样的对象包括一个名为“county”的属性(所有人都说,有100个不同的县)。 Here's a snippet: 这是一个片段:

{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 4, "vDEM": 0, "vREP": 2, "vUNA": 2, "vTOT": 4}, "geometry": {"type":"Polygon","coordinates":[[[-79.537429,35.843303],[-79.542428,35.843303],[-79.542428,35.848302],[-79.537429,35.848302],[-79.537429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"NEW HANOVER", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.532429,35.843303],[-79.537428,35.843303],[-79.537428,35.848302],[-79.532429,35.848302],[-79.532429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.527429,35.843303],[-79.532428,35.843303],[-79.532428,35.848302],[-79.527429,35.848302],[-79.527429,35.843303]]]}},

I need to split this into 100 separate files, each containing one county's GeoJSONs, and each named xxxx_bins_2016.json (where xxxx is the county's name). 我需要将其拆分为100个单独的文件,每个文件包含一个县的GeoJSON,每个文件名为xxxx_bins_2016.json(其中xxxx是县名)。 I'd also like the final character (comma) at the end of each such file to go away. 我也希望每个这样的文件末尾的最后一个字符(逗号)消失。

I'm doing this in Mac OSX, if that matters. 我在Mac OSX中这样做,如果这很重要的话。 I hope to learn a lot by studying any solutions you could suggest, so if you feel like taking the time to explain the 'why' as well as the 'what' that would be fantastic. 我希望通过研究你能提出的任何解决方案来学到很多东西,所以如果你想花时间解释'为什么'以及那些将会很棒的'什么'。 Thanks! 谢谢!

EDITED to make clear that there are different county names, some of them two-word names. 编辑,以明确有不同的县名,其中一些是双字名。

jq can kind of do this; jq 那种可以做到这一点; it can group the input and output one line of text per group. 它可以对输入进行分组,并为每组输出一行文本。 The shell then takes care of writing each line to an appropriately named file. 然后shell负责将每一行写入适当命名的文件。 jq itself doesn't really have the ability to open files for writing that would allow you to do this in a single process. jq本身并没有真正能够打开文件进行编写,这样你就可以在一个进程中完成这项工作。

jq -Rn -c '[inputs[:-1]|fromjson] | group_by(.properties.county)[]' tmp.json |
  while IFS= read -r line; do
    county=$(jq -r '.[0].properties.county' <<< $line)
    jq -r '.[]' <<< "$line" > "$county.txt"
done

[inputs[:-1]|fromjson] reads each line of your file as a string, strips the trailing comma, then parses the line as JSON and wraps the lines into a single array. [inputs[:-1]|fromjson]以字符串形式读取文件的每一行, [inputs[:-1]|fromjson]尾随的逗号,然后将该行解析为JSON并将这些行包装成单个数组。 The resulting array is sorted and grouped by county name, then written to standard output, one group per line. 生成的数组按县名排序和分组,然后写入标准输出,每行一组。

The shell loop reads each line, extracts the county name from the first element of the group with a call to jq , then uses jq again to write each element of the group to the appropriate file, again one element per line. shell循环读取每一行,通过调用jq从组的第一个元素中提取县名,然后再次使用jq将组的每个元素写入相应的文件,每行再一个元素。

(A quick look at https://github.com/stedolan/jq/issues doesn't appear to show any requests yet for an output function that would let you open and write to a file from inside a jq filter. I'm thinking of something like (快速浏览一下https://github.com/stedolan/jq/issues似乎没有显示任何output函数的请求,它可以让你打开并从jq过滤器内部写入文件。我是想着类似的东西

jq -Rn '... | group_by(.properties.county) | output("\(.properties.county).txt")' tmp.json

without the need for the shell loop.) 不需要shell循环。)

If using string parsing rather than proper JSON parsing to extract the county name is acceptable - brittle in general, but would work in this simple case - consider Sam Tolton's GNU awk answer , which has the potential to be by far the simplest and fastest solution. 如果使用字符串解析而不是正确的JSON解析来提取县名是可以接受的 - 一般来说很脆弱,但是在这个简单的情况下可以工作 - 考虑一下Sam Tolton的GNU awk答案 ,它有可能成为迄今为止最简单,最快速的解决方案。

To complement chepner's excellent answer with a variation that focuses on performance: 通过专注于性能的变体来补充chepner的出色答案

jq -Rrn '[inputs[:-1]|fromjson] | .properties.county + "|" + (.|tostring)' file |
  awk -F'|' '{ print $2 > ($1 "_bins_2016.json") }'

Shell loops are avoided altogether, which should speed up the operation. 完全避免使用Shell循环,这样可以加快操作速度。

The general idea is: 一般的想法是:

  • Use jq to trim the trailing , from each input line, interpret the trimmed string as JSON, extract the county name, then output the trimmed JSON strings prepended with the county name and a distinct separator, | 使用jq修剪尾随,从每个输入行,将修剪后的字符串解释为JSON,提取县名,然后输出前缀为县名和不同分隔符的修剪过的JSON字符串, | .

  • Use an awk command to split each line into the prepended county name and the trimmed JSON string, which allows awk to easily construct the output filename and write the JSON string to it. 使用awk命令将每一行拆分为前置的县名和修剪后的JSON字符串,这允许awk轻松构造输出文件名并将JSON字符串写入其中。

Note: The awk command keeps all output files open until the script has finished, which means that, in your case, 100 output files will be open simultaneously - a number that shouldn't be a problem, however. 注: awk命令保存所有输出文件打开,直到脚本已经完成,这意味着,在你的情况下,100个的输出文件将同时打开-一个数字,不应该是一个问题,但是。

In cases where it is a problem, you can use the following variation, in which jq first sorts the lines by county name, which then allows awk to immediately close the previous output field whenever the next county is reached in the input: 如果是一个问题,你可以使用以下变体,其中jq首先按县名对行进行排序,然后允许awk在输入中到达下一个县时立即关闭前一个输出字段:

jq -Rrn '
  [inputs[:-1]|fromjson] | sort_by(.properties.county)[] | 
    .properties.county + "|" + (.|tostring)
' file | 
   awk -F'|' '
    prevCounty != $1 { if (outFile) close(outFile); outFile = $1 "_bins_2016.json" }
    { print $2 > outFile; prevCounty = $1  }
  '

A simpler version of chepner's answer : 更简洁的chepner's answer版本:

while IFS= read -r line
do 
    countyName=$(jq --raw-output '.properties.county' <<<"${line: : -1}")
    jq <<< "${line: : -1}" >> "$countyName"_bins_2016.json
done<file

The idea is to filter the county name using a jq filter after stripping the , from each line of your input file. 我们的想法是利用过滤县名jq剥离后过滤,从你的输入文件的每一行。 Then the line is passed to jq as plain stream to produce a JSON file in prettified format. 然后该行作为普通流传jq ,以生成美化格式的JSON文件。

If you are from a relatively older version of bash (< 4.0 ) use "${line%?}" over "${line: : -1}" 如果您来自相对较旧版本的bash (< 4.0 ),请使用"${line%?}"不是"${line: : -1}"

For example with the change above, one of your county becomes, 例如,如果上面的更改,您的一个县成为,

cat ALAMANCE_bins_2016.json
{
  "type": "Feature",
  "properties": {
    "county": "ALAMANCE",
    "vBLA": 0,
    "vWHI": 0,
    "vDEM": 0,
    "vREP": 0,
    "vUNA": 0,
    "vTOT": 0
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -79.527429,
          35.843303
        ],
        [
          -79.532428,
          35.843303
        ],
        [
          -79.532428,
          35.848302
        ],
        [
          -79.527429,
          35.848302
        ],
        [
          -79.527429,
          35.843303
        ]
      ]
    ]
  }
}

Note : The current solution could be performance intensive as reading file line by line is an expensive operation, and equally invoking jq for each of the lines. 注意 :当前的解决方案可能是性能密集型的,因为逐行读取文件是一项昂贵的操作,并且同样调用每行的jq

This will do what you want minus getting rid of the last comma:- 这将做你想要的东西减去最后一个逗号: -

gawk 'match($0, /"county":"([^"]+)/, array){ print >array[1]"_bins_2016.json" }' INPUT_FILE

This will output files in the current path with a filename in the format COUNTRY NAME_bins_2016.json . 这将输出当前路径中的文件,文件COUNTRY NAME_bins_2016.json

The script goes line by line and uses a regex to match the exact term "country":" followed by 1 or more characters that aren't a " . 该脚本逐行排列并使用正则表达式匹配确切的术语"country":"后跟一个或多个不是"字符。 It captures the characters within the quotes and then uses it as part of the filename to append the current line to. 它捕获引号中的字符,然后将其用作文件名的一部分以附加当前行。

To remove the trailing comma from all .json files in the current path you could use:- 要删除当前路径中所有.json文件的尾随逗号,您可以使用: -

sed -i '$ s/,$//' *.json

If you were certain that the last char was always a comma, a faster solution would be to use truncate:- 如果您确定最后一个字符始终是逗号,则更快的解决方案是使用truncate: -

truncate -s-1 *.json

Last part taken from this answer: https://stackoverflow.com/a/40568723/1453798 最后一部分来自这个答案: https//stackoverflow.com/a/40568723/1453798

Here is a quickie script that will do the job. 这是一个可以完成这项工作的快速脚本。 It has the virtue of working on most systems without having to install any other tools. 它具有在大多数系统上工作的优点,而无需安装任何其他工具。

IFS=$'\n'
counties=( $( sed 's/^.*"county":"//;s/".*$//' counties.txt ) )
unset IFS

for county in "${!counties[@]}"
do
  county="${counties[$i]}"
  filename="$county".out.txt
  echo "'$filename'"
  grep "\"$county\"" counties.txt > "$filename"
done

The setting of IFS to \\n allows the array elements to contain spaces. 将IFS设置为\\n允许数组元素包含空格。 The sed command strips off all the text up to the start of the county name and all the text after it. sed命令将所有文本删除到县名的开头以及之后的所有文本。 The for loop is the form that allows iterating over the array. for循环是允许迭代数组的形式。 Finally, the grep command needs to have double quotes around the search string so that counties that are substrings of other counties don't accidentally get put into the wrong file. 最后, grep命令需要在搜索字符串周围加上双引号,以便作为其他县的子字符串的县不会意外地被放入错误的文件中。

See this section of the GNU BASH Reference Manual for more info. 有关详细信息,请参阅GNU BASH参考手册的此部分

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM