Bash: Loop Read N lines at time from CSV

Question

I have a csv file of 100000 ids

wef7efwe1fwe8
wef7efwe1fwe3
ewefwefwfwgrwergrgr

that are being transformed into a json object using jq

output=$(jq -Rsn '
{"id":
  [inputs
    | . / "\n"
    | (.[] | select(length > 0) | . / ";") as $input
    | $input[0]]}
' <$FILE)

output

{
  "id": [
         "wef7efwe1fwe8",
         "wef7efwe1fwe3",
         ....
   ]
}

currently, I need to manually split the file into smaller 10000 line files... because the API call has a limit.

I would like a way to automatically loop through the large file... and only use 10000 lines as a time as $FILE... up until the end of the list.

Answer 1

I would use the split command and write a little shell script around it:

#!/bin/bash
input_file=ids.txt
temp_dir=splits
api_limit=10000

# Make sure that there are no leftovers from previous runs
rm -rf "${temp_dir}"
# Create temporary folder for splitting the file
mkdir "${temp_dir}"
# Split the input file based on the api limit
split --lines "${api_limit}" "${input_file}" "${temp_dir}/"

# Iterate through splits and make an api call per split
for split in "${temp_dir}"/* ; do
    jq -Rsn '
        {"id":
          [inputs
            | . / "\n"
            | (.[] | select(length > 0) | . / ";") as $input
            | $input[0]]
        }' "${split}" > api_payload.json

    # now do something ...
    # curl -dapi_payload.json http://...

    rm -f api_payload.json
done

# Clean up
rm -rf "${temp_dir}"

Answer 2

Here's a simple and efficient solution that at its core just uses jq. It takes advantage of the -c command-line option. I've used xargs printf ... for illustration - mainly to show how easy it is to set up a shell pipeline.

< data.txt jq -Rnc '
  def batch($n; stream):
    def b: [limit($n; stream)]
    | select(length > 0)
    | (., b);
    b;

  {id: batch(10000; inputs | select(length>0) | (. / ";")[0])}
' | xargs printf "%s\n"

Parameterizing batch size

It might make sense to set things up so that the batch size is specified outside the jq program. This could be done in numerous ways, eg by invoking jq along the lines of:

jq --argjson n 10000 ....

and of course using $n instead of 10000 in the jq program.

Why “def b:”?

For efficiency. jq's TCO (tail recursion optimization) only works for arity-0 filters.

Note on -s

In the Q as originally posted, the command-line options -sn are used in conjunction with inputs . Using -s with inputs defeats the whole purpose of inputs , which is to make it possible to process input in a stream-oriented way (ie one line of input or one JSON entity at a time).

Bash: Loop Read N lines at time from CSV

Question

2 answers

solution1
1 ACCPTED 2020-08-29 17:35:35

solution2
1 2020-08-30 01:49:59

Parameterizing batch size

Why “def b:”?

Note on -s

Bash: Loop Read N lines at time from CSV

Question

2 answers

solution1 1 ACCPTED 2020-08-29 17:35:35

solution2 1 2020-08-30 01:49:59

Parameterizing batch size

Why “def b:”?

Note on -s

solution1
1 ACCPTED 2020-08-29 17:35:35

solution2
1 2020-08-30 01:49:59