[英]How to manipulate files stored in S3 without saving them to the server?
I have the following python script that downloads two files from an S3 compatible service. 我有以下python脚本,可从S3兼容服务下载两个文件。 Then merges them and uploads the output to another bucket.
然后合并它们,并将输出上传到另一个存储桶。
import time
import boto3
import pandas as pd
timestamp = int(time.time())
conn = boto3.client('s3')
conn.download_file('segment', 'segment.csv', 'segment.csv')
conn.download_file('payment', 'payments.csv', 'payments.csv')
paymentsfile = 'payments.csv'
segmentsfile = 'segment.csv'
outputfile = 'payments_merged_' + str(timestamp) + '.csv'
csv_payments = pd.read_csv(paymentsfile, dtype={'ID': float})
csv_segments = pd.read_csv(segmentsfile, dtype={'ID': float})
csv_payments = csv_payments.merge(csv_segments, on='ID')
open(outputfile, 'a').close()
csv_payments.to_csv(outputfile)
conn.upload_file(outputfile, backup, outputfile)
However if I execute the script it stores the files in the folder of my script. 但是,如果执行脚本,它将文件存储在脚本的文件夹中。 For security reasons I would like to prevent this to happen.
出于安全原因,我想防止这种情况的发生。 I could delete the files after the script was executed but let's assume my script is located in the folder
/app/script/
. 我可以在执行脚本后删除文件,但是假设我的脚本位于
/app/script/
文件夹中。 This means for a short time, while the script is being executed, someone could open the url example.com/app/script/payments.csv
and download the file. 这意味着在短时间内执行脚本时,有人可以打开URL
example.com/app/script/payments.csv
并下载文件。 What is a good solution for that? 有什么好的解决方案?
The simplest way would be to modify the configuration of your web server to not serve the directory that you are writing to or write to a directory that isn't served. 最简单的方法是修改Web服务器的配置,使其不提供正在写入的目录或不提供写入的目录。 For example, a common practice is to use /scr for this type of thing.
例如,一种常见的做法是将/ scr用于此类事物。 You would need to modify permissions for the user your web server runs under to ensure it has access to /scr.
您需要为运行Web服务器的用户修改权限,以确保其有权访问/ scr。
To restrict web server access to the directory you write to you can use the following in Nginx - 要限制Web服务器访问您写入的目录,可以在Nginx中使用以下命令-
https://serverfault.com/questions/137907/how-to-restrict-access-to-directory-and-subdirs https://serverfault.com/questions/137907/how-to-restrict-access-to-directory-and-subdirs
For Apache you can use this example - 对于Apache,您可以使用以下示例-
https://serverfault.com/questions/174708/apache2-how-do-i-restrict-access-to-a-directory-but-allow-access-to-one-file-w https://serverfault.com/questions/174708/apache2-how-do-i-restrict-access-to-a-directory-but-allow-allow-access-to-one-file-w
In fact, pandas.read_csv let you read a buffer or byte object. 实际上,pandas.read_csv允许您读取缓冲区或字节对象。 You can do everything in the memory.
您可以完成内存中的所有操作。 Either put this script in a instance, even better, you can run it as AWS lambda process if the file is small.
将此脚本放在一个实例中,甚至更好,如果文件很小,则可以将其作为AWS lambda进程运行。
import time
import boto3
import pandas as pd
paymentsfile = 'payments.csv'
segmentsfile = 'segment.csv'
outputfile = 'payments_merged_' + str(timestamp) + '.csv'
s3 = boto3.client('s3')
payment_obj = s3.get_object(Bucket='payment', Key=paymentsfile )
segment_obj = s3.get_object(Bucket='segment', Key=segmentsfile )
csv_payments = pd.read_csv(payment_obj['Body'], dtype={'ID': float})
csv_segments = pd.read_csv(segments_obj['Body'], dtype={'ID': float})
csv_merge = csv_payments.merge(csv_segments, on='ID')
csv_merge.to_csv(buffer)
buffer.seek(0)
s3.upload_fileobj(buffer, 'bucket_name', outputfile )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.