简体   繁体   中英

Difference between two version of files in S3

I have a bucket in S3 with versioning enabled. There is a file that comes is and updates its contents. There is a unique identifies in that file and I sometime with the new file coming in, the content of the existing is not there, which needs to be retained. My goal here is to have a file which has all the contents of the new file and all the stuff from the old file which was not there.

I have a small python script which does the job and I can schedule it on S3 trigger as well, but is there any AWS implementation for this issue? like using S3 -> XXXX service that would give the changes in between the files (not line by line though) and maybe creates a new file.

my python code looks something like:

    old_file = 'file1.1.txt'
    new_file = 'file1.2.txt'
    output_file = 'output_pd.txt'

    # Read the old file into a Pandas dataframe
    old_df = pd.read_csv(old_file, sep="\t", header=None)
    # car_df = pd.read_csv(car_file, sep="\t")
    new_df = pd.read_csv(new_file, sep="\t", header=None)

    # Find the values that are present in the old file and missing in the new file
    missing_values = old_df[~old_df.iloc[:,0].isin(new_df.iloc[:,0])]

    # Append the missing values to the new file
    final_df = new_df.append(missing_values, ignore_index=True)

    # Write the final dataframe to a new file
    final_df.to_csv(output_file, sep=' ', index=False, header=None)

But looking for some native AWS solution/ best practice.

but is there any AWS implementation for this issue?

No, there is no any native AWS implementation for comparing files' content. You have to implement that yourself, as you did right now. You can host your code as a lambda function will will be automatically triggered by S3 uploads.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM