简体   繁体   中英

Basic bash-only: Insert/update .csv file with another .csv file via primary-key with unknown schema

currently we need to merge massive files (multiple GB-CSV files). As it is right now I've written a java code to dynamically update the data. Sadly the performance is extremely low. Due to our provided architecture we're forced to use either this or a bash-script. We can't install any packages, so csvtools is out of reach as well. We can only use native Ubuntu-functions.

What we have: 2 csv-files with an unknown schema. We only know that the first entry is always the ID which is the primary key. The 2 compared files itself have always the same schema. It is required to either update old entries with the 2nd file, and insert non-existent KV.

Example 1:

example1.csv

id|kv1|kv2|kv3|kv4|kv5|kv6|kv7|kv8|kv9|kv10|kv11
519712|3|57181|5|"another value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000||2000-11-15 14:16:51.0000000
519713|3|57281|5|"another value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000||2000-11-15 14:16:51.0000000
519714|3|57481|5|"another value"|"yay string"|||"another string2"|2020-08-11 00:00:00.0000000||2000-11-15 14:14:51.0000000

example2.csv

id|kv1|kv2|kv3|kv4|kv5|kv6|kv7|kv8|kv9|kv10|kv11
519712|3|57181|5|"I changed value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000||2000-11-15 14:16:51.0000000
519715|3|57281|5|"another value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000||2000-11-15 14:16:51.0000000
519716|3|57481|5|"another value"|"yay string"|||"another string2"|2020-08-11 00:00:00.0000000||2000-11-15 14:14:51.0000000

Expected output:

id|kv1|kv2|kv3|kv4|kv5|kv6|kv7|kv8|kv9|kv10|kv11
519712|3|57181|5|"I changed value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000||2000-11-15 14:16:51.0000000
519713|3|57281|5|"another value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000||2000-11-15 14:16:51.0000000
519714|3|57481|5|"another value"|"yay string"|||"another string2"|2020-08-11 00:00:00.0000000||2000-11-15 14:14:51.0000000
519715|3|57281|5|"another value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000||2000-11-15 14:16:51.0000000
519716|3|57481|5|"another value"|"yay string"|||"another string2"|2020-08-11 00:00:00.0000000||2000-11-15 14:14:51.0000000

Example 2:

example1.csv

id|kv1|kv2|kv3|kv4|kv5|kv6|kv7|kv8|kv9|kv10
519712|3|57181|5|"another value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000|2000-11-15 14:16:51.0000000
519713|3|57281|5|"another value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000|2000-11-15 14:16:51.0000000
519714|3|57481|5|"another value"|"yay string"|||"another string2"|2020-08-11 00:00:00.0000000|2000-11-15 14:14:51.0000000

example2.csv

id|kv1|kv2|kv3|kv4|kv5|kv6|kv7|kv8|kv9|kv10
519712|3|57181|5|"I changed value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000|2000-11-15 14:16:51.0000000
519715|3|57281|5|"another value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000|2000-11-15 14:16:51.0000000
519716|3|57481|5|"another value"|"yay string"|||"another string2"|2020-08-11 00:00:00.0000000|2000-11-15 14:14:51.0000000

Expected output:

id|kv1|kv2|kv3|kv4|kv5|kv6|kv7|kv8|kv9|kv10
519712|3|57181|5|"I changed value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000|2000-11-15 14:16:51.0000000
519713|3|57281|5|"another value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000|2000-11-15 14:16:51.0000000
519714|3|57481|5|"another value"|"yay string"|||"another string2"|2020-08-11 00:00:00.0000000|2000-11-15 14:14:51.0000000
519715|3|57281|5|"another value"|"yay string"|||"another string"|2020-08-11 00:00:00.0000000|2000-11-15 14:16:51.0000000
519716|3|57481|5|"another value"|"yay string"|||"another string2"|2020-08-11 00:00:00.0000000|2000-11-15 14:14:51.0000000

I've already tried to adapt to with these topics:

My attempts for instance:

join -a1 -j 1 -t"|" <(sort example1.csv | sort -b) <(sort example2.csv | sort -b) | sort -n

But this is bugging out. It's handling Primary key as string instead as a numeric value (seems like its the only way)

join --header --nocheck-order -j 1 -t"|" example1.csv example2.csv

This was the most "convinient" approach since it ignores the header as well, and prints it to the next file. Sadly 0 results.

join -j 1 -t"|" <(sed 1,1d example1.csv |  sort --field-separator='|' -k 1b,1) <(sed 1,1d example2.csv |  sort --field-separator='|' -k 1b,1)

Even if I specifically state that everything has to be handled numerically it doesn't work as well.

Any idea?

Best regards

Resolved it myself. Since the ID is always the first column I use it as a blacklist.

# !/usr/bin/env bash
FILTER_TEMP_FILE=/PATH/blaa.temp
trap "{ rm -f $FILTER_TEMP_FILE; }" EXIT
NUM_COLUMNS=$(head -1 ${INPUTFILE} | sed 's/[^|]//g' | wc -c)
echo "Creating update-filter in ${FILTER_TEMP_FILE}"
OUTPUT_LAYOUT=1.1
for (( i=2; i <= $NUM_COLUMNS; ++i ))
do
    OUTPUT_LAYOUT=$OUTPUT_LAYOUT',1.'$i
done

After that I use that for the joincommand.

join -j 1 -t '|' -o $OUTPUT_LAYOUT <(sed 1,1d $1 | sort --temporary-directory=/tmp/ --field-separator='|' -k 1b,1) <(sed 1,1d $MERGEFILE | sort --temporary-directory=/tmp/ --field-separator='|' -k 1b,1) > ${FILTER_TEMP_FILE}
awk -F, '(NR==FNR){a[$INPUTFILE];next}!($INPUTFILE in a)' ${FILTER_TEMP_FILE} ${INPUTFILE} > ${OUTPUTFILE}
sed 1,1d ${MERGEFILE } >> ${OUTPUTFILE}
sort --temporary-directory=/tmp/ -n ${OUTPUTFILE} -o ${OUTPUTFILE}

Note that for large files it can get problematic. I will need to readjust it so it only uses RAM (Having ~250Gigs of RAM on our server, but barely any tmp space... quite the irony)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM