Bash：从 CSV 中一次循环读取 N 行

Question

I have a csv file of 100000 ids我有一个 100000 个 ID 的 csv 文件

wef7efwe1fwe8
wef7efwe1fwe3
ewefwefwfwgrwergrgr

that are being transformed into a json object using jq正在使用 jq 转换为 json 对象

output=$(jq -Rsn '
{"id":
  [inputs
    | . / "\n"
    | (.[] | select(length > 0) | . / ";") as $input
    | $input[0]]}
' <$FILE)

output输出

{
  "id": [
         "wef7efwe1fwe8",
         "wef7efwe1fwe3",
         ....
   ]
}

currently, I need to manually split the file into smaller 10000 line files... because the API call has a limit.目前，我需要手动将文件拆分为更小的 10000 行文件......因为 API 调用有限制。

I would like a way to automatically loop through the large file... and only use 10000 lines as a time as $FILE... up until the end of the list.我想要一种自动循环遍历大文件的方法...并且只使用 10000 行作为 $FILE... 直到列表末尾。

Answer 1

I would use the split command and write a little shell script around it:我会使用split命令并围绕它编写一个小 shell 脚本：

#!/bin/bash
input_file=ids.txt
temp_dir=splits
api_limit=10000

# Make sure that there are no leftovers from previous runs
rm -rf "${temp_dir}"
# Create temporary folder for splitting the file
mkdir "${temp_dir}"
# Split the input file based on the api limit
split --lines "${api_limit}" "${input_file}" "${temp_dir}/"

# Iterate through splits and make an api call per split
for split in "${temp_dir}"/* ; do
    jq -Rsn '
        {"id":
          [inputs
            | . / "\n"
            | (.[] | select(length > 0) | . / ";") as $input
            | $input[0]]
        }' "${split}" > api_payload.json

    # now do something ...
    # curl -dapi_payload.json http://...

    rm -f api_payload.json
done

# Clean up
rm -rf "${temp_dir}"

Answer 2

Here's a simple and efficient solution that at its core just uses jq.这是一个简单而有效的解决方案，其核心仅使用 jq。 It takes advantage of the -c command-line option.它利用了 -c 命令行选项。 I've used xargs printf ... for illustration - mainly to show how easy it is to set up a shell pipeline.我使用xargs printf ...进行说明 - 主要是为了展示设置 shell 管道是多么容易。

< data.txt jq -Rnc '
  def batch($n; stream):
    def b: [limit($n; stream)]
    | select(length > 0)
    | (., b);
    b;

  {id: batch(10000; inputs | select(length>0) | (. / ";")[0])}
' | xargs printf "%s\n"

Parameterizing batch size参数化批量大小

It might make sense to set things up so that the batch size is specified outside the jq program.进行设置以便在 jq 程序之外指定批大小可能是有意义的。 This could be done in numerous ways, eg by invoking jq along the lines of:这可以通过多种方式完成，例如通过沿以下方式调用 jq ：

jq --argjson n 10000 ....

and of course using $n instead of 10000 in the jq program.当然，在 jq 程序中使用$n而不是 10000。

Why “def b:”?为什么是“def b:”？

For efficiency.为了效率。 jq's TCO (tail recursion optimization) only works for arity-0 filters. jq 的 TCO（尾递归优化）仅适用于 arity-0 过滤器。

Note on -s注意 -s

In the Q as originally posted, the command-line options -sn are used in conjunction with inputs .在最初发布的 Q 中，命令行选项 -sn 与inputs结合使用。 Using -s with inputs defeats the whole purpose of inputs , which is to make it possible to process input in a stream-oriented way (ie one line of input or one JSON entity at a time).使用-s与inputs失败的整个目的inputs ，这是为了使其能够在一个面向流的方式处理输入（一次输入或一个JSON实体即，一个线）。

Bash：从 CSV 中一次循环读取 N 行

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-08-29 17:35:35

解决方案2
1 2020-08-30 01:49:59

Parameterizing batch size参数化批量大小

Why “def b:”?为什么是“def b:”？

Note on -s注意 -s

Bash：从 CSV 中一次循环读取 N 行

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-08-29 17:35:35

解决方案2 1 2020-08-30 01:49:59

Parameterizing batch size参数化批量大小

Why “def b:”?为什么是“def b:”？

Note on -s注意 -s

解决方案1
1 已采纳 2020-08-29 17:35:35

解决方案2
1 2020-08-30 01:49:59