在 Django 网站上处理大文件的最佳方法

Question

Good morning all.大家早上好。

I have a generic question about the best approach to handle large files with Django.我有一个关于使用 Django 处理大文件的最佳方法的通用问题。

I created a python project where the user is able to read a binary file (usually the size is between 30-100MB).我创建了一个 python 项目，用户可以在其中读取二进制文件（通常大小在 30-100MB 之间）。 Once the file is read, the program processes the file and shows relevant metrics to the user.读取文件后，程序会处理文件并向用户显示相关指标。 Basically it outputs the max, min, average, std of the data.基本上它输出数据的最大值、最小值、平均值、标准值。

At the moment, you can only run this project from the cmd line.目前，您只能从 cmd 行运行此项目。 I'm trying to create a user interface so that anyone can use it.我正在尝试创建一个用户界面，以便任何人都可以使用它。 I decided to create a webpage using django.我决定使用 django 创建一个网页。 The page is very simple.页面非常简单。 The user uploads files, he then selects which file he wants to process and it shows the metrics to the user.用户上传文件，然后他选择要处理的文件并向用户显示指标。

Working on my local machine I was able to implement it.在我的本地机器上工作，我能够实现它。 I upload the files (it saves on the user's laptop and then it processes it).我上传文件（它保存在用户的笔记本电脑上，然后处理它）。 I then created an S3 account, and now the files are all uploaded to S3.然后我创建了一个S3帐户，现在文件全部上传到S3。 The problem that I'm having is that when I try to get the file (I'm using smart_open ( https://pypi.org/project/smart-open/ )) it is really slow to read the file (for a 30MB file it's taking 300sec), but if I download the file and read it, it only takes me 8sec.我遇到的问题是，当我尝试获取文件时（我使用的是 smart_open（ https://pypi.org/project/smart-open/ ））读取文件真的很慢（对于一个30MB 文件需要 300 秒），但如果我下载文件并阅读它，只需要 8 秒。

My question is: What is the best approach to retrieve files from S3, and process them?我的问题是：从 S3 检索文件并处理它们的最佳方法是什么？ I'm thinking of simply downloading the file to my server, process it, and then deleting it.我想简单地将文件下载到我的服务器，处理它，然后删除它。 I've tried this on my localhost and it's fast.我在我的本地主机上试过这个，它很快。 Downloading from S3 takes 5sec and processing takes 4sec.从 S3 下载需要 5 秒，处理需要 4 秒。

Would this be a good approach?这会是一个好方法吗？ I'm a bit afraid that for instance if I have 10 users at the same time and each one creates a report then I'll have 10*30MB = 300MB of space that the server needs.我有点担心，例如，如果我同时有 10 个用户并且每个用户都创建一个报告，那么我将拥有 10*30MB = 300MB 的服务器空间。 Is this something practical, or will I fill up the server?这是实用的东西，还是我会填满服务器？

Thank you for your time!感谢您的时间！

Edit To give a bit more of a context, what's making it show is the f.read() line.编辑为了提供更多的上下文，让它显示的是 f.read() 行。 Due to the format of the binary file.由于二进制文件的格式。 I have to read the file in the following way:我必须通过以下方式读取文件：

name = f.read(30)
unit = f.read(5)
data_length = f.read(2)
data = f.read(data_length)   <- This is the part that is taking a lot of time when I read it directly from S3. If I download the file, then this is super fast.

Answer 1

All,全部，

After some experimenting, I found a solution that works for me.经过一些试验，我找到了一个适合我的解决方案。

with open('temp_file_name', 'wb') as data:
    s3.download_fileobj(Bucket='YOURBUCKETNAME', Key='YOURKEY', data)

read_file('temp_file_name')
os.remove('temp_file_name')

I don't know if this is the best approach or what are the possible downfalls of this approach.我不知道这是否是最好的方法，或者这种方法可能的缺点是什么。 I'll use it and come back to this post if I end up using a different solution.如果我最终使用不同的解决方案，我会使用它并回到这篇文章。

The problem with my previous approach was that f.read() was taking too long, the problem seems to be that every time I need to read a new line, the program needs to connect to S3 (or something) and this is taking too long.我以前的方法的问题是 f.read() 花费的时间太长，问题似乎是每次我需要读取新行时，程序都需要连接到 S3（或其他东西），这也需要长。 What ended up working for me, was to download the file directly to my server, read it, and then deleting it once I read the file.最终对我有用的是将文件直接下载到我的服务器，阅读它，然后在我阅读文件后将其删除。 Using this solution I was able to get the speeds that I was getting when working on a localserver (reading directly from my laptop).使用此解决方案，我能够获得在本地服务器上工作时获得的速度（直接从我的笔记本电脑读取）。

If you are working with medium size files (30-50mb) this approach seems to work.如果您正在处理中等大小的文件（30-50mb），这种方法似乎有效。 My only concern is if we try to download a really large file if the server will run out of disk space.我唯一担心的是，如果服务器将耗尽磁盘空间，我们是否会尝试下载一个非常大的文件。

在 Django 网站上处理大文件的最佳方法

问题描述

1 个解决方案

解决方案1
1 2020-09-01 13:29:05

在 Django 网站上处理大文件的最佳方法

问题描述

1 个解决方案

解决方案1 1 2020-09-01 13:29:05

解决方案1
1 2020-09-01 13:29:05