Bash：从 CSV 中一次循环读取 N 行

Question

我有一个 100000 个 ID 的 csv 文件

wef7efwe1fwe8
wef7efwe1fwe3
ewefwefwfwgrwergrgr

正在使用 jq 转换为 json 对象

output=$(jq -Rsn '
{"id":
  [inputs
    | . / "\n"
    | (.[] | select(length > 0) | . / ";") as $input
    | $input[0]]}
' <$FILE)

输出

{
  "id": [
         "wef7efwe1fwe8",
         "wef7efwe1fwe3",
         ....
   ]
}

目前，我需要手动将文件拆分为更小的 10000 行文件......因为 API 调用有限制。

我想要一种自动循环遍历大文件的方法...并且只使用 10000 行作为 $FILE... 直到列表末尾。

Answer 1

我会使用split命令并围绕它编写一个小 shell 脚本：

#!/bin/bash
input_file=ids.txt
temp_dir=splits
api_limit=10000

# Make sure that there are no leftovers from previous runs
rm -rf "${temp_dir}"
# Create temporary folder for splitting the file
mkdir "${temp_dir}"
# Split the input file based on the api limit
split --lines "${api_limit}" "${input_file}" "${temp_dir}/"

# Iterate through splits and make an api call per split
for split in "${temp_dir}"/* ; do
    jq -Rsn '
        {"id":
          [inputs
            | . / "\n"
            | (.[] | select(length > 0) | . / ";") as $input
            | $input[0]]
        }' "${split}" > api_payload.json

    # now do something ...
    # curl -dapi_payload.json http://...

    rm -f api_payload.json
done

# Clean up
rm -rf "${temp_dir}"

Answer 2

这是一个简单而有效的解决方案，其核心仅使用 jq。 它利用了 -c 命令行选项。 我使用xargs printf ...进行说明 - 主要是为了展示设置 shell 管道是多么容易。

< data.txt jq -Rnc '
  def batch($n; stream):
    def b: [limit($n; stream)]
    | select(length > 0)
    | (., b);
    b;

  {id: batch(10000; inputs | select(length>0) | (. / ";")[0])}
' | xargs printf "%s\n"

参数化批量大小

进行设置以便在 jq 程序之外指定批大小可能是有意义的。 这可以通过多种方式完成，例如通过沿以下方式调用 jq ：

jq --argjson n 10000 ....

当然，在 jq 程序中使用$n而不是 10000。

为什么是“def b:”？

为了效率。 jq 的 TCO（尾递归优化）仅适用于 arity-0 过滤器。

注意 -s

在最初发布的 Q 中，命令行选项 -sn 与inputs结合使用。 使用-s与inputs失败的整个目的inputs ，这是为了使其能够在一个面向流的方式处理输入（一次输入或一个JSON实体即，一个线）。

Bash：从 CSV 中一次循环读取 N 行

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-08-29 17:35:35

解决方案2
1 2020-08-30 01:49:59

参数化批量大小

为什么是“def b:”？

注意 -s

Bash：从 CSV 中一次循环读取 N 行

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-08-29 17:35:35

解决方案2 1 2020-08-30 01:49:59

参数化批量大小

为什么是“def b:”？

注意 -s

解决方案1
1 已采纳 2020-08-29 17:35:35

解决方案2
1 2020-08-30 01:49:59