简体   繁体   中英

Combine multiple files into single file in unix shell scripting

I want to combine the data of 3 (say) files having the same columns and datatype for those, into a single file, which I can further use for processing.
Currently I have to process the files one after the other. So, I am looking for a solution which I can write in a script to combine all the files into one single file.
For ex:
File 1:

mike,sweden,2015
tom,USA,1522
raj,india,455

File 2:

a,xyz,155
b,pqr,3215
c,lmn,3252

Expected combined file 3:

mike,sweden,2015
tom,USA,1522
raj,india,455
a,xyz,155
b,pqr,3215
c,lmn,3252

Kindly help me with this.

A script like:

#!/bin/sh
sort "$1" "$2" | uniq > "$3"

should do the trick. Sort will sort the concatenation of the two files (two first args of the script), pass the result to uniq which will remove adjacent identical lines and push the result into the third file (third arg of the script).

Answer to the original form of the question: As @Lars states in a comment on the question, it looks like a simple concatenation of the input files is desired, which is precisely what cat is for (and even named for):

cat file1 file2 > file3

To fulfill the requirements you added later:

#!/bin/sh

# Concatenate the input files and sort them with duplicates removed
# and save to output file.
cat "$1" "$2" | sort -u  > "$3"

Note, however, that you can combine the concatenation and sorting into a single step, as demonstrated by Jean-Baptiste Yunès's answer :

# Sort the input files directly with duplicates removed and save to output file.
sort -u "$1" "$2" > "$3"

Note that using sort is the simplest way to eliminate duplicates . If you don't want sorting , you'll have to use a different, more complex approach, eg with awk :

#!/bin/sh

# Process the combined input and only 
# output the first occurrence in a set of duplicates to the output file.
awk '!seen[$0]++' "$1" "$2" > "$3"

!seen[$0]++ is a common awk idiom to only print the first in a set of duplicates:

  • seen is an associative array that is filled with each input line ( $0 ) as the key (index), with each element created on demand.

  • This implies that all lines from a set of duplicates (even if not adjacent) refer to the same array element.

  • In a numerical context, awk 's variable values and array elements are implicitly 0 , so when a given input line is seen for the first time and the post-decrement ( ++ ) is applied, the resulting value of the element is 1 .

  • Whenever a duplicate of that line is later encountered, the value of the array element is incremented.
  • The net effect is that for any given input line !seen[$0]++ returns true if the input line is seen for the first time, and false for each of its duplicates, if any. Note that ++ , due to being a post -increment, is only applied after !seen[$0] is evaluated.

    • ! negates the value of seen[$0] , causing a value of 0 - which is false in a Boolean context to return true , and any nonzero value (encountered for duplicates) to return false .
  • !seen[$0]++ is an instance of a so-called pattern in awk - a condition evaluated against the input line that determines whether the associated action (a block of code) should be processed. Here, there is no action, in which case awk implicitly simply prints the input line, if !seen[$0]++ indicates true .

  • The overall effect is: Lines are printed in input order, but for lines with duplicates only the first instance is printed, effectively eliminating duplicates.

Note that this approach can be problematic with large input files with few duplicates, because most of the data must then be held in memory.

If your file naming convention is same(say file1,file2,file3...fileN), then you can use this to combine all.

cat file* > combined_file

Edit: Script to do the same assuming you are passing file names as parameter

#!/bin/sh
cat $1 $2 $3 | uniq > combined_file

Now you can display combined_file if you want. Or access it directly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM