[英]Bash/*NIX: split a file into multiple files on a substring
Variants of this question have been asked and answered before, but I find that my sed/grep/awk skills are far too rudimentary to work from those to a custom solution since I hardly ever work in shell scripts. 之前已经问过并回答过这个问题的变种,但是我发现我的sed / grep / awk技能从那些工作到自定义解决方案太过简陋,因为我几乎没有在shell脚本中工作。
I have a rather large (100K+ lines) text file in which each line defines a GeoJSON object, each such object including a property called "county" (there are, all told, 100 different counties). 我有一个相当大的(100K +行)文本文件,其中每行定义一个GeoJSON对象,每个这样的对象包括一个名为“county”的属性(所有人都说,有100个不同的县)。 Here's a snippet: 这是一个片段:
{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 4, "vDEM": 0, "vREP": 2, "vUNA": 2, "vTOT": 4}, "geometry": {"type":"Polygon","coordinates":[[[-79.537429,35.843303],[-79.542428,35.843303],[-79.542428,35.848302],[-79.537429,35.848302],[-79.537429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"NEW HANOVER", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.532429,35.843303],[-79.537428,35.843303],[-79.537428,35.848302],[-79.532429,35.848302],[-79.532429,35.843303]]]}},
{"type": "Feature", "properties": {"county":"ALAMANCE", "vBLA": 0, "vWHI": 0, "vDEM": 0, "vREP": 0, "vUNA": 0, "vTOT": 0}, "geometry": {"type":"Polygon","coordinates":[[[-79.527429,35.843303],[-79.532428,35.843303],[-79.532428,35.848302],[-79.527429,35.848302],[-79.527429,35.843303]]]}},
I need to split this into 100 separate files, each containing one county's GeoJSONs, and each named xxxx_bins_2016.json (where xxxx is the county's name). 我需要将其拆分为100个单独的文件,每个文件包含一个县的GeoJSON,每个文件名为xxxx_bins_2016.json(其中xxxx是县名)。 I'd also like the final character (comma) at the end of each such file to go away. 我也希望每个这样的文件末尾的最后一个字符(逗号)消失。
I'm doing this in Mac OSX, if that matters. 我在Mac OSX中这样做,如果这很重要的话。 I hope to learn a lot by studying any solutions you could suggest, so if you feel like taking the time to explain the 'why' as well as the 'what' that would be fantastic. 我希望通过研究你能提出的任何解决方案来学到很多东西,所以如果你想花时间解释'为什么'以及那些将会很棒的'什么'。 Thanks! 谢谢!
EDITED to make clear that there are different county names, some of them two-word names. 编辑,以明确有不同的县名,其中一些是双字名。
jq
can kind of do this; jq
那种可以做到这一点; it can group the input and output one line of text per group. 它可以对输入进行分组,并为每组输出一行文本。 The shell then takes care of writing each line to an appropriately named file. 然后shell负责将每一行写入适当命名的文件。 jq
itself doesn't really have the ability to open files for writing that would allow you to do this in a single process. jq
本身并没有真正能够打开文件进行编写,这样你就可以在一个进程中完成这项工作。
jq -Rn -c '[inputs[:-1]|fromjson] | group_by(.properties.county)[]' tmp.json |
while IFS= read -r line; do
county=$(jq -r '.[0].properties.county' <<< $line)
jq -r '.[]' <<< "$line" > "$county.txt"
done
[inputs[:-1]|fromjson]
reads each line of your file as a string, strips the trailing comma, then parses the line as JSON and wraps the lines into a single array. [inputs[:-1]|fromjson]
以字符串形式读取文件的每一行, [inputs[:-1]|fromjson]
尾随的逗号,然后将该行解析为JSON并将这些行包装成单个数组。 The resulting array is sorted and grouped by county name, then written to standard output, one group per line. 生成的数组按县名排序和分组,然后写入标准输出,每行一组。
The shell loop reads each line, extracts the county name from the first element of the group with a call to jq
, then uses jq
again to write each element of the group to the appropriate file, again one element per line. shell循环读取每一行,通过调用jq
从组的第一个元素中提取县名,然后再次使用jq
将组的每个元素写入相应的文件,每行再一个元素。
(A quick look at https://github.com/stedolan/jq/issues doesn't appear to show any requests yet for an output
function that would let you open and write to a file from inside a jq
filter. I'm thinking of something like (快速浏览一下https://github.com/stedolan/jq/issues似乎没有显示任何output
函数的请求,它可以让你打开并从jq
过滤器内部写入文件。我是想着类似的东西
jq -Rn '... | group_by(.properties.county) | output("\(.properties.county).txt")' tmp.json
without the need for the shell loop.) 不需要shell循环。)
If using string parsing rather than proper JSON parsing to extract the county name is acceptable - brittle in general, but would work in this simple case - consider Sam Tolton's GNU awk
answer , which has the potential to be by far the simplest and fastest solution. 如果使用字符串解析而不是正确的JSON解析来提取县名是可以接受的 - 一般来说很脆弱,但是在这个简单的情况下可以工作 - 考虑一下Sam Tolton的GNU awk
答案 ,它有可能成为迄今为止最简单,最快速的解决方案。
To complement chepner's excellent answer with a variation that focuses on performance: 通过专注于性能的变体来补充chepner的出色答案 :
jq -Rrn '[inputs[:-1]|fromjson] | .properties.county + "|" + (.|tostring)' file |
awk -F'|' '{ print $2 > ($1 "_bins_2016.json") }'
Shell loops are avoided altogether, which should speed up the operation. 完全避免使用Shell循环,这样可以加快操作速度。
The general idea is: 一般的想法是:
Use jq
to trim the trailing ,
from each input line, interpret the trimmed string as JSON, extract the county name, then output the trimmed JSON strings prepended with the county name and a distinct separator, |
使用jq
修剪尾随,
从每个输入行,将修剪后的字符串解释为JSON,提取县名,然后输出前缀为县名和不同分隔符的修剪过的JSON字符串, |
. 。
Use an awk
command to split each line into the prepended county name and the trimmed JSON string, which allows awk
to easily construct the output filename and write the JSON string to it. 使用awk
命令将每一行拆分为前置的县名和修剪后的JSON字符串,这允许awk
轻松构造输出文件名并将JSON字符串写入其中。
Note: The awk
command keeps all output files open until the script has finished, which means that, in your case, 100 output files will be open simultaneously - a number that shouldn't be a problem, however. 注: awk
命令保存所有输出文件打开,直到脚本已经完成,这意味着,在你的情况下,100个的输出文件将同时打开-一个数字,不应该是一个问题,但是。
In cases where it is a problem, you can use the following variation, in which jq
first sorts the lines by county name, which then allows awk
to immediately close the previous output field whenever the next county is reached in the input: 如果是一个问题,你可以使用以下变体,其中jq
首先按县名对行进行排序,然后允许awk
在输入中到达下一个县时立即关闭前一个输出字段:
jq -Rrn '
[inputs[:-1]|fromjson] | sort_by(.properties.county)[] |
.properties.county + "|" + (.|tostring)
' file |
awk -F'|' '
prevCounty != $1 { if (outFile) close(outFile); outFile = $1 "_bins_2016.json" }
{ print $2 > outFile; prevCounty = $1 }
'
A simpler version of chepner's answer
: 更简洁的chepner's answer
版本:
while IFS= read -r line
do
countyName=$(jq --raw-output '.properties.county' <<<"${line: : -1}")
jq <<< "${line: : -1}" >> "$countyName"_bins_2016.json
done<file
The idea is to filter the county name using a jq
filter after stripping the ,
from each line of your input file. 我们的想法是利用过滤县名jq
剥离后过滤,
从你的输入文件的每一行。 Then the line is passed to jq
as plain stream to produce a JSON
file in prettified format. 然后该行作为普通流传jq
,以生成美化格式的JSON
文件。
If you are from a relatively older version of bash
(< 4.0
) use "${line%?}"
over "${line: : -1}"
如果您来自相对较旧版本的bash
(< 4.0
),请使用"${line%?}"
不是"${line: : -1}"
For example with the change above, one of your county becomes, 例如,如果上面的更改,您的一个县成为,
cat ALAMANCE_bins_2016.json
{
"type": "Feature",
"properties": {
"county": "ALAMANCE",
"vBLA": 0,
"vWHI": 0,
"vDEM": 0,
"vREP": 0,
"vUNA": 0,
"vTOT": 0
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-79.527429,
35.843303
],
[
-79.532428,
35.843303
],
[
-79.532428,
35.848302
],
[
-79.527429,
35.848302
],
[
-79.527429,
35.843303
]
]
]
}
}
Note : The current solution could be performance intensive as reading file line by line is an expensive operation, and equally invoking jq
for each of the lines. 注意 :当前的解决方案可能是性能密集型的,因为逐行读取文件是一项昂贵的操作,并且同样调用每行的jq
。
This will do what you want minus getting rid of the last comma:- 这将做你想要的东西减去最后一个逗号: -
gawk 'match($0, /"county":"([^"]+)/, array){ print >array[1]"_bins_2016.json" }' INPUT_FILE
This will output files in the current path with a filename in the format COUNTRY NAME_bins_2016.json
. 这将输出当前路径中的文件,文件COUNTRY NAME_bins_2016.json
。
The script goes line by line and uses a regex to match the exact term "country":"
followed by 1 or more characters that aren't a "
. 该脚本逐行排列并使用正则表达式匹配确切的术语"country":"
后跟一个或多个不是"
字符。 It captures the characters within the quotes and then uses it as part of the filename to append the current line to. 它捕获引号中的字符,然后将其用作文件名的一部分以附加当前行。
To remove the trailing comma from all .json files in the current path you could use:- 要删除当前路径中所有.json文件的尾随逗号,您可以使用: -
sed -i '$ s/,$//' *.json
If you were certain that the last char was always a comma, a faster solution would be to use truncate:- 如果您确定最后一个字符始终是逗号,则更快的解决方案是使用truncate: -
truncate -s-1 *.json
Last part taken from this answer: https://stackoverflow.com/a/40568723/1453798 最后一部分来自这个答案: https : //stackoverflow.com/a/40568723/1453798
Here is a quickie script that will do the job. 这是一个可以完成这项工作的快速脚本。 It has the virtue of working on most systems without having to install any other tools. 它具有在大多数系统上工作的优点,而无需安装任何其他工具。
IFS=$'\n'
counties=( $( sed 's/^.*"county":"//;s/".*$//' counties.txt ) )
unset IFS
for county in "${!counties[@]}"
do
county="${counties[$i]}"
filename="$county".out.txt
echo "'$filename'"
grep "\"$county\"" counties.txt > "$filename"
done
The setting of IFS to \\n
allows the array elements to contain spaces. 将IFS设置为\\n
允许数组元素包含空格。 The sed
command strips off all the text up to the start of the county name and all the text after it. sed
命令将所有文本删除到县名的开头以及之后的所有文本。 The for
loop is the form that allows iterating over the array. for
循环是允许迭代数组的形式。 Finally, the grep
command needs to have double quotes around the search string so that counties that are substrings of other counties don't accidentally get put into the wrong file. 最后, grep
命令需要在搜索字符串周围加上双引号,以便作为其他县的子字符串的县不会意外地被放入错误的文件中。
See this section of the GNU BASH Reference Manual for more info. 有关详细信息,请参阅GNU BASH参考手册的此部分 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.