简体   繁体   English

确定HTTP文件上传的MIME类型的最佳方法是什么?

[英]What is the best way to determine the mime type of an http file upload?

Assume you have an html form with an input tag of type 'file'. 假设您有一个HTML表单,其输入标签的类型为“文件”。 When the file is posted to the server it will be stored locally, along with relevant metadata. 将文件发布到服务器后,它将与相关元数据一起存储在本地。

I can think of three ways to determine the mime type: 我可以想到三种确定MIME类型的方法:

  • Use the mime type supplied in the 'multipart/form-data' payload. 使用“ multipart / form-data”有效载荷中提供的mime类型。
  • Use the file name supplied in the 'multipart/form-data' payload and look up the mime type based on the file extension. 使用“ multipart / form-data”有效负载中提供的文件名,并根据文件扩展名查找mime类型。
  • scan the raw file data and use a mime type guessing library. 扫描原始文件数据并使用mime类型猜测库。

None of these solutions are perfect. 这些解决方案都不是完美的。

Which is the most accurate solution? 哪个是最准确的解决方案?
Is there another, better option? 还有其他更好的选择吗?

If you are using PHP then you can use 如果您使用的是PHP,则可以使用

http://pecl.php.net/package/Fileinfo http://pecl.php.net/package/Fileinfo

Which will inspect many aspects of the file. 它将检查文件的许多方面。 For Python you can use 对于Python,您可以使用

http://pypi.python.org/pypi/python-magic/0.1 http://pypi.python.org/pypi/python-magic/0.1

Which is the bindings for libmagic on Linux/Unix and possibly Windows? 在Linux / Unix和Windows上,libmagic的绑定是什么? systems. 系统。 See: 看到:

man magic
man libmagic

On Linux. 在Linux上。 It uses magic number tests to try and assert mime-types of files. 它使用幻数测试来尝试并确定文件的mime类型。

I like the magic number method, because it can catch wrong extensions and alot of trickery if you are handling files on a webserver that are uploaded. 我喜欢魔术数字方法,因为如果您在上传的Web服务器上处理文件,它可能会捕获错误的扩展名和很多技巧。 These tests are generally one-offs so the performance hit of reading through the file is negligible. 这些测试通常是一次性的,因此通过文件读取对性能的影响可以忽略不计。

I don't think you can rely on any one of these as being the definite "I am mime type x". 我认为您不能依靠其中任何一个作为确定的“我是哑剧类型x”。 The problem with the first two are that the content type supplied may be incorrect, because of issues with the client (browser or otherwise) or a misleading request (various hack attempts etc...) from various clients. 前两个的问题是,由于客户端(浏览器或其他)的问题或来自各个客户端的误导性请求(各种黑客尝试等),提供的内容类型可能不正确。

So you should probably try and combine information from each type and work out some sort of confidence level. 因此,您可能应该尝试合并每种类型的信息,并得出某种置信度。 Iif the file extension says .doc and the mime type is application/msword then there's a pretty good chance it's a word document, but run it through a mime type detection utility just to make sure. 如果文件扩展名为.doc,且MIME类型为application / msword,则很有可能是word文档,但可以通过MIME类型检测实用程序运行它以确保。

There should be a solution available for mime magic detection with the language you're using - you didn't mention which one though. 应该有一种使用您使用的语言进行mime魔术检测的解决方案-尽管您没有提到哪一种。 They all generally work by looking at the first few bytes/characters of the file and match them against a lookup table of mime types. 它们通常都通过查看文件的前几个字节/字符来工作,并将它们与mime类型的查找表进行匹配。 Some also remove the BOM from the file to help with this. 有些人还从文件中删除了BOM,以帮助解决这个问题。 Often they fall back to plain text if the mime type can't be detected. 如果无法检测到哑剧类型,它们通常会退回到纯文本格式。

If you want a platform independent approach to this then take a look at the various Java libraries that exist: 如果您想要一种独立于平台的方法,那么请看一下存在的各种Java库:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM