简体   繁体   English

存储大量用户数据的最佳方式

[英]Best way to store large amount of data of users

I store files of users in their own name directory something like 我将用户的文件存储在自己的名称目录中

/username/file01.jpg
/username/file02.mp4
/username/file03.mp3

But if more users come and upload more files then this creates problem because this will lead to migration of some or many users to another drive.I choose username directory solution first because i dont want filenames to be mixed. 但是如果有更多用户来上传更多文件,那么这会产生问题,因为这会导致一些或多个用户迁移到另一个驱动器。我首先选择用户名目录解决方案,因为我不希望混合文件名。 I dont want to change filename too. 我也不想改变文件名。 Also if another user upload same filename then it creates problem ,if the files are stored with original name. 此外,如果另一个用户上传相同的文件名,那么如果文件以原始名称存储,则会产生问题。

What could be the best way to do this. 什么是最好的方法来做到这一点。 I have one solution but want to ask community is this the best way . 我有一个解决方案,但想问社区这是最好的方法。

i will use sequential folders and then hash the file name to some thing very unique and store into the directory. 我将使用顺序文件夹,然后将文件名哈希到一些非常独特的东西并存储到目录中。 What i will do is store the original name of file and username into database and hashvalue of filename which is stored in Disk. 我要做的是将文件和用户名的原始名称存储到数据库中,并将存储在磁盘中的文件名的哈希值存储起来。

When anyone want to access that file,I will read that file through php either replace the name or will do something at that point so that the file is downloaded as original filename. 当有人想要访问该文件时,我将通过php读取该文件,或者替换名称或者在该点执行某些操作,以便将文件作为原始文件名下载。

I have only this proposed solution in mind. 我只考虑了这个建议的解决方案。 Do you guys have any other better than this one. 你们有没有比这更好的其他人。

Edit: 编辑:

I use folder system too, and possibly for 2nd way i will use virtual folders. 我也使用文件夹系统,可能第二种方式我将使用虚拟文件夹。 My database is MongoDB 我的数据库是MongoDB

Guys all your answers were awesome and really helpful. 伙计们你所有的答案都很棒,非常有帮助。 i wanted to give bounty to everyone, thats why i left it so that community can provide automatically. 我想给每个人一个赏金,这就是为什么我离开它以便社区可以自动提供。 Thanks all for your answers.I really appreciate it. 谢谢大家的回答。我真的很感激。

Could you create relational MySQL tables? 你能创建关系MySQL表吗? eg: 例如:

A users table and a files table. users表和files表。

Your users table would keep track of everything you are (I assume) already tracking: 您的用户表将跟踪您已经跟踪的所有内容(我假设):

id , name , email , etc. idnameemail

Then the files table would store something like: 然后文件表将存储如下内容:

id , fileExtension , fileSize , userID <---- userID would be the foreign key pointing to the id field in the files table. idfileExtensionfileSizeuserID <---- userID将是指向files表中id字段的外键。

then when you save your file you could save it as it's id . 然后当你保存你的文件时,你可以保存它的id fileExtension and use a query to pull the user associated with that file, or all files associated with a user. fileExtension并使用查询来提取与该文件关联的用户或与用户关联的所有文件。

eg: 例如:

SELECT users.name, files.id, files.extension
FROM `users`
INNER JOIN `files` on users.id = files.userID;

I handle file metadata on the database and retrive the files with a UUID. 我处理数据库上的文件元数据并使用UUID检索文件。 What i do is: 我做的是:

  1. Content based identification 基于内容的识别
    1. MD5 from file's content MD5来自文件的内容
    2. Namespaced UUID:v5 to generate unique identifier based on user's uuid and file's md5. Namespaced UUID:v5根据用户的uuid和文件的md5生成唯一标识符。
    3. Custom function to generate path based on 'realname'. 自定义函数,根据'realname'生成路径。
    4. Save on the database: uuid, originalname (the uploaded name), realname (the generated name), filesize, and mime. 保存在数据库上:uuid,originalname(上传的名称),realname(生成的名称),filesize和mime。 (optional dateAdded, and md5) (可选dateAdded和md5)
  2. File retrival. 文件回溯。
    1. UUID to retrive metadata. UUID用于检索元数据。
    2. regenerate filepath based on realname. 根据realname重新生成文件路径。
    3. Originalname is used to show a familiar name to the user that downloads the file. Originalname用于向下载文件的用户显示熟悉的名称。

I process the file's name assigning it a namespaced UUID as the database primary key, and Generate the path based on User and filename. 我处理文件的名称,为其分配一个命名空间的UUID作为数据库主键,并根据用户和文件名生成路径。 The precondition is that your user has a uuid assigned to him. 前提条件是您的用户已分配给他的uuid。 The following code will help you avoid id collisions on the database, and help you identify files by its contents (If you ever need to have a way to spot duplicate content and not necesarily filenames). 以下代码将帮助您避免数据库中的ID冲突,并帮助您通过其内容识别文件(如果您需要有方法来发现重复的内容而不是必需的文件名)。

$fileInfo = pathinfo($_FILE['file']['name']);
$extension = (isset($fileInfo['extension']))?".".$fileInfo['extension']:"";

$md5Name = md5_file($_FILE['file']['tmp_name']); //you could use other hash algorithms if you are so inclined.

$realName = UUID::v5($user->uuid, $md5Name) . $extension; //UUID::v5(namespace, value).

I use a function to generate the filepath based on some custom parameteres, you could use $username and $realname. 我使用函数根据一些自定义参数生成文件路径,你可以使用$ username和$ realname。 This is helpful if you implement a distributed folder structure which you might have partitioned on file naming scheme, or any custom scheme. 如果您实现可能已在文件命名方案或任何自定义方案上进行分区的分布式文件夹结构,这将非常有用。

function generateBasePath($realname, $customArgsArray){
    //Process Args as your requirements.
    //might as well be  "$FirstThreeCharsFromRealname/"
    //or a checksum that helps you decide which drive/volume/mountpoint to use.
    //like some files on the local disk and some other from an Amazon::S3 mountpoint.
    return $mountpoint.'/'.$generatedPath; 
}

As an added bonus this also: 作为额外的奖励,这也是:

  1. helps you maintain a versioned file repository if you add an attribute on the file's record of which file (uuid) it has replaced. 如果在文件记录中添加了已替换的文件(uuid),则可以帮助您维护版本化文件存储库。
  2. create a application Access Control List if you add an attributes of 'owner' and/or 'group' 如果添加“所有者”和/或“组”属性,请创建应用程序访问控制列表
  3. also works on a single folder structure. 也适用于单个文件夹结构。

Note: I used php's $_FILE as an example of the file source based on this question's tags. 注意:我使用php的$ _FILE作为基于此问题标签的文件源示例。 It can be from any file source or generated content. 它可以来自任何文件源或生成的内容。

Since you already use MongoDB, I would suggest checking out GridFS. 由于您已经使用过MongoDB,我建议您查看GridFS。 It's a specification that allows you to store files(even if they are larger than 16mb) into MongoDB collections. 它是一个允许您将文件(即使它们大于16mb)存储到MongoDB集合中的规范。

It is scalable, so you'll have no problems if you add another server, it also stores metadata, it is possible to read files in chunks and it also has built in backup functions. 它是可扩展的,因此如果您添加另一台服务器,它也会没有问题,它还可以存储元数据,可以读取块中的文件,还具有内置的备份功能。

I would generate a GUID based on a hash of the filename, Date and Time of the Upload and username for the Filename, save those values, as well as the path to the file in a database for later use. 我将基于文件名的散列,上载的日期和时间以及文件名的用户名生成GUID,保存这些值,以及数据库中文件的路径供以后使用。 If you generate such a GUID, the filenames can not be guessed. 如果生成这样的GUID,则无法猜到文件名。

As example lets take user Daniel Steiner (me) uploads a file called resume.doc on the 23rd of april 2013 at 37 past twelve am to your server. 例如,让用户Daniel Steiner(我)在2013年4月23日上午12点37分将一个名为resume.doc的文件上传到您的服务器。 this would give a base value of Daniel_Steiner+2013/23/04+00:37+resume.doc which then would be as MD5 hash 05c2d2f501e738b930885d991d136f1e. 这将给出Daniel_Steiner + 2013/23/04 + 00:37 + resume.doc的基值,然后将其作为MD5哈希05c2d2f501e738b930885d991d136f1e。 to ensure that the file will be opened in the right programm, we will afterwards add the right file ending and thus will get something like http://link.to/your/site/05c2d2f501e738b930885d991d136f1e.doc If your useraccounts already have a user id, you could add those to the URL, for example, if my User ID would be 123145, the url would be http://link.to/your/site/123145/05c2d2f501e738b930885d991d136f1e.doc 为了确保文件将在正确的程序中打开,我们将在之后添加正确的文件结束,因此将获得类似http://link.to/your/site/05c2d2f501e738b930885d991d136f1e.doc的内容如果您的useraccounts已经拥有用户ID ,你可以将这些添加到URL,例如,如果我的用户ID是123145,网址将是http://link.to/your/site/123145/05c2d2f501e738b930885d991d136f1e.doc

If you save the original filename to the database, you can later also offer a downloadscript that provides the file with its original filename for download, even tough it has another filename on your server. 如果将原始文件名保存到数据库,稍后您还可以提供下载文件,该文件提供文件的原始文件名以供下载,即使服务器上有另一个文件名也很难。

In case you can use symbolic links, relocating the files on another harddisk shouldn't be a problem either. 如果您可以使用符号链接,重新定位另一个硬盘上的文件也不应该是一个问题。

If you want to, I could come up with an PHP example as well - shouldn't be too much code. 如果你愿意,我也可以提出一个PHP示例 - 不应该是太多的代码。

Since filesystem is a tree, not a graph (faceted classification), its hard to come up with some way for it to easily represent multiple entities, like users, media types, dates, events, image crop types etc. Thats why using relational database is easier - it is convertible to graph. 由于文件系统是树,而不是图形(分面分类),因此很难用它来轻松地表示多个实体,如用户,媒体类型,日期,事件,图像裁剪类型等。这就是为什么使用关系数据库更容易 - 它可以转换为图形。

But since its another level of abstraction, you need to write functions that do low-level synchronization yourself, including avoiding name collisions, long path names, large file count per folder, ease of transfer per-entity, horizontal scaling etc. So it depends how complex your application needs to be 但是,由于它是另一个抽象级别,您需要编写自己进行低级同步的函数,包括避免名称冲突,长路径名,每个文件夹的大文件数,每个实体的转移容易程度,水平扩展等等。所以它取决于您的应用程序需要多么复杂

Another tactic is to create a 2-dimensional structure where the first level of directories are the first 2 characters of the username, then the second level is the remaining characters (similar to how Git stores its SHA-1 object IDs). 另一种策略是创建一个二维结构,其中第一级目录是用户名的前2个字符,然后第二级是剩余字符(类似于Git如何存储其SHA-1对象ID)。 For example: 例如:

/files/jr/andomuser/456.jpg

for user 'jrandomuser'. 对于用户'jrandomuser'。

Please note that as usernames will likely not be distributed as randomly as SHA-1 values, you may need to add another level later on. 请注意,由于用户名可能不会像SHA-1值那样随机分发,因此您可能需要稍后再添加其他级别。 Doubt it, though. 不过对此表示怀疑。

I suggest to use following database structure: 我建议使用以下数据库结构:

在此输入图像描述

Where File table has at least: File表至少有:

在此输入图像描述

IDFile is an auto_increment column / primary key. IDFileauto_increment列/主键。 UserID is nullable foreign key. UserID是可以为nullable外键。

For FK_File_User I suggest: 对于FK_File_User我建议:

ON UPDATE NO ACTION -- IDUser is auto_increment too. No changes need to be tracked.
ON DELETE SET NULL  -- If user deleted, then File is not owned. Might be deleted
                    -- with CRON job or something else.

Still, another columns might be added to the File table: 仍然可以在File表中添加另一列:

  1. Actual upload date and time 实际上传日期和时间
  2. Actual mime-type 实际的哑剧型
  3. Actual storage place (for distributed storage systems) 实际存储位置(适用于分布式存储系统)
  4. Download count (another table might be a better solution) 下载计数(另一个表可能是更好的解决方案)

etc... 等等...

Some benefits: 一些好处:

  1. You don't need to calculate file size, hash, extension or any file meta, because you might obtain it with one database operation. 您不需要计算文件大小,散列,扩展或任何文件元,因为您可以通过一个数据库操作获得它。
  2. You can obtain statistics for each user of a file count / space used / whatever you wrote to File table by single SELECT ... GROUP BY ... WITH ROLLUP statement, and it would be faster, than analysis of actual files, which may be spread across multiple storage devices. 您可以通过单个SELECT ... GROUP BY ... WITH ROLLUP语句获取所使用的文件计数/空间的每个用户的统计信息/您写入File表的任何内容,并且它将比分析实际文件更快,分布在多个存储设备上。
  3. You may apply file access permissions for different users. 您可以为不同的用户应用文件访问权限。 It will cost not significant change of table structures database. 表结构数据库的成本不会很大。

I don't consider as an option, that original filenames needed at storage, because of two reasons: 我不认为存储需要原始文件名,因为有两个原因:

  1. File may have name, which not correctly supported by Server OS filesystem, like Cyrillic ones. 文件可能具有名称,服务器操作系统文件系统不能正确支持,如Cyrillic文件系统。
  2. Two different files may have completely identical names, so one of them might be overwritten by another. 两个不同的文件可能具有完全相同的名称,因此其中一个可能被另一个文件覆盖。

So, there is a solution: 所以,有一个解决方案:

1) Rename files when they are uploaded to IDFile from INSERT into File table. 1)从INSERT上传到IDFileFile表时重命名文件。 It's safe and there are no dublicates. 这是安全的,没有共和党人。

2) Restore name of the file, when it's needed / downloaded, like: 2)在需要/下载时恢复文件名,如:

// peform query to "File" table by given ID

list($name, $ext, $size, $md5) = $result->fetch_row();

$result->free();

header('Content-Length: ' . $size);
header('Content-MD5: ' . $md5);
header('Accept-Ranges: bytes');
header('Connection: close');
header('Content-Type: application/force-download');
header('Content-Disposition: attachment; filename="' . $name . '.' . $ext . '"');

// flush file content

3) Actual files may be stored within single directory (because IDFile is safe) and IDUser -named subdirectory - depends on a situation. 3)实际文件可以存储在单个目录中(因为IDFile是安全的)和IDUser子目录 - 取决于具体情况。

4) As IDFile is a direct sequence, if some of files are gone missing, you may obtain their database meta by evaluating missing segments of actual filenames sequence. 4)由于IDFile是直接序列,如果某些文件丢失,您可以通过评估实际文件名序列的缺失段来获取其数据库元。 Then, you may "inform owners", "delete file meta" or both of this actions. 然后,您可以“通知所有者”,“删除文件元”或这两个操作。


I'm against the idea of storing large actual files in DBMS itself as a binary content . 我反对将大型实际文件存储在DBMS本身作为二进制内容的想法

DBMS is about data and analysis, it's not a FileSystem, and should never be used in that way, if my humble opinion matters. DBMS是关于数据和分析的,它不是文件系统,如果我的拙见很重要,就不应该以这种方式使用。

You can install a LDAP server. 您可以安装LDAP服务器。 LDAP lookup is very fast since it is highly optimized for heavy read operations. LDAP查找速度非常快,因为它针对繁重的读取操作进行了高度优化。 You can even query for data 您甚至可以查询数据

LDAP organizes the data in a tree like fashion. LDAP以类似时尚的方式组织数据。

You can organize data as following example "user->IP address->folder->file name". 您可以按照以下示例“user-> IP address-> folder-> file name”组织数据。 This way file could be physically/geographically spread out and you can fetch the location very quickly. 这种方式文件可以在物理/地理上分散,您可以非常快速地获取位置。

You can query too using standard LDAP query for eg get all the list of file for a particular user or get the list of files in the folder etc. 您也可以使用标准LDAP查询进行查询,例如获取特定用户的所有文件列表或获取文件夹中的文件列表等。

  1. Mongodb to store the actual filename (eg: myImage.jpg) and other attributes (eg: MIME types), plus $random-text.jpg from 2. & 3. below Mongodb用于存储实际文件名(例如:myImage.jpg)和其他属性(例如:MIME类型),以及来自2.&3的$random-text.jpg

  2. Generate some $random-text , eg: base_convert(mt_rand(), 10, 36) or uniqid($username, true); 生成一些$random-text ,例如: base_convert(mt_rand(), 10, 36)uniqid($username, true);

  3. Physically store the file as $random-text.jpg - always good to maintain same extension 将文件物理存储为$random-text.jpg - 始终保持相同的扩展名

  4. NOTE: Use filter_var() to ensure the input filename doesn't pose security risk to Mongodb. 注意:使用filter_var()确保输入文件名不会给Mongodb带来安全风险。

Amazon S3 is reliable and cheap, be aware of "Eventual Concurrency" with S3. Amazon S3可靠且便宜,请注意S3的“最终并发”。

Assuming users have a unique ID (Primary Key) in the database, if a user with ID 73 uploads a file, save it like this: 假设用户在数据库中有唯一的ID(主键),如果ID为73的用户上传文件,请将其保存为:

"uploads/$userid_$filename.$ext" “上传/ $ userid_ $文件名。$分机”

For example, 73_resume.doc, 73_myphoto.jpg 例如,73_resume.doc,73_myphoto.jpg

Now, when fetching files, use this code: 现在,在获取文件时,请使用以下代码:

foreach (glob("uploads/$userid_*.*") as $filename) {
    echo $filename;
}

This can be combined with hashing solutions (stored in the DB), so that a user who gets a download path as 73_photo.jpg does not randomly try 74_photo.jpg in the browser address bar. 这可以与散列解决方案(存储在数据库中)结合使用,因此获取下载路径为73_photo.jpg的用户不会在浏览器地址栏中随机尝试74_photo.jpg。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM