I want to combine the data of 3 (say) files having the same columns and datatype for those, into a single file, which I can further use for processing.
Currently I have to process the files one after the other. So, I am looking for a solution which I can write in a script to combine all the files into one single file.
For ex:
File 1:
mike,sweden,2015
tom,USA,1522
raj,india,455
File 2:
a,xyz,155
b,pqr,3215
c,lmn,3252
Expected combined file 3:
mike,sweden,2015
tom,USA,1522
raj,india,455
a,xyz,155
b,pqr,3215
c,lmn,3252
Kindly help me with this.
A script like:
#!/bin/sh
sort "$1" "$2" | uniq > "$3"
should do the trick. Sort will sort the concatenation of the two files (two first args of the script), pass the result to uniq
which will remove adjacent identical lines and push the result into the third file (third arg of the script).
Answer to the original form of the question: As @Lars states in a comment on the question, it looks like a simple concatenation of the input files is desired, which is precisely what cat
is for (and even named for):
cat file1 file2 > file3
To fulfill the requirements you added later:
#!/bin/sh
# Concatenate the input files and sort them with duplicates removed
# and save to output file.
cat "$1" "$2" | sort -u > "$3"
Note, however, that you can combine the concatenation and sorting into a single step, as demonstrated by Jean-Baptiste Yunès's answer :
# Sort the input files directly with duplicates removed and save to output file.
sort -u "$1" "$2" > "$3"
Note that using sort
is the simplest way to eliminate duplicates . If you don't want sorting , you'll have to use a different, more complex approach, eg with awk
:
#!/bin/sh
# Process the combined input and only
# output the first occurrence in a set of duplicates to the output file.
awk '!seen[$0]++' "$1" "$2" > "$3"
!seen[$0]++
is a common awk
idiom to only print the first in a set of duplicates:
seen
is an associative array that is filled with each input line ( $0
) as the key (index), with each element created on demand.
This implies that all lines from a set of duplicates (even if not adjacent) refer to the same array element.
In a numerical context, awk
's variable values and array elements are implicitly 0
, so when a given input line is seen for the first time and the post-decrement ( ++
) is applied, the resulting value of the element is 1
.
The net effect is that for any given input line !seen[$0]++
returns true if the input line is seen for the first time, and false for each of its duplicates, if any. Note that ++
, due to being a post -increment, is only applied after !seen[$0]
is evaluated.
!
negates the value of seen[$0]
, causing a value of 0
- which is false in a Boolean context to return true , and any nonzero value (encountered for duplicates) to return false . !seen[$0]++
is an instance of a so-called pattern in awk
- a condition evaluated against the input line that determines whether the associated action (a block of code) should be processed. Here, there is no action, in which case awk
implicitly simply prints the input line, if !seen[$0]++
indicates true .
The overall effect is: Lines are printed in input order, but for lines with duplicates only the first instance is printed, effectively eliminating duplicates.
Note that this approach can be problematic with large input files with few duplicates, because most of the data must then be held in memory.
If your file naming convention is same(say file1,file2,file3...fileN), then you can use this to combine all.
cat file* > combined_file
Edit: Script to do the same assuming you are passing file names as parameter
#!/bin/sh
cat $1 $2 $3 | uniq > combined_file
Now you can display combined_file if you want. Or access it directly.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.