简体   繁体   English

如何从文件字节确定扩展名

[英]How to determine extension from fileBytes

My application allows users to download files.我的应用程序允许用户下载文件。 While creating headers I am using Tika to set extension as shown below.在创建标题时,我使用 Tika 设置扩展名,如下所示。 This works fine for pdf files.这适用于 pdf 文件。 Fails for DOC and EXCEL files. DOC 和 EXCEL 文件失败。

private HttpHeaders getHeaderData(byte[] fileBytes) throws IOException, MimeTypeException {
        final HttpHeaders headers = new HttpHeaders();

        TikaInputStream tikaStream = TikaInputStream.get(fileBytes);
        Tika tika = new Tika();
        String mimeType = tika.detect(tikaStream);
        headers.setContentType(MediaType.valueOf(mimeType));

        MimeTypes defaultMimeTypes = MimeTypes.getDefaultMimeTypes();
        String extension = defaultMimeTypes.forName(mimeType).getExtension();
        headers.add("file-ext", extension);

        return headers;
    }

I see that the mimeType is resolved to "application/pdf" for pdf files but resolves to " application/x-tika-ooxml " for excel and word files which is the problem.我看到对于 pdf 文件,mimeType 解析为“application/pdf” ,但对于 excel 和 word 文件解析为“ application/x-tika-ooxml ”,这是问题所在。 How can I get word(.docx) and excel (xlx, xlsx) formats if I have a file in bytes.如果我有一个以字节为单位的文件,我如何获得 word(.docx) 和 excel (xlx, xlsx) 格式。

Why does this work for pdf?为什么这适用于pdf?

Summary概括

The short answer is: You have to use Tika's detector with its MediaType class - not MimeTypes .简短的回答是:您必须将 Tika 的检测器与其MediaType类一起使用,而不是MimeTypes

The slightly longer answer is: Even that will not get you all the way, because of how older MS-Office files are structured.稍微长一点的答案是:即使这样也不能让您一帆风顺,因为 MS-Office 文件的结构是多么旧。 For those you have to also parse the files, and inspect their metadata.对于那些你还必须解析文件并检查它们的元数据的人。

The term "media type" has replaced the term "MIME type" - see here :术语“媒体类型”已经取代了术语“MIME 类型” - 请参见 此处

[RFC2046] specifies that Media Types (formerly known as MIME types) and Media Subtypes will be assigned and listed by the IANA. [RFC2046] 指定媒体类型(以前称为 MIME 类型)和媒体子类型将由 IANA 分配和列出。

Office 97-2003办公室 97-2003

When Tika inspects Excel and Word 97-2003 files using its detector, it will return a media type of application/x-tika-msoffice .当 Tika 使用其检测器检查 Excel 和 Word 97-2003 文件时,它将返回媒体类型application/x-tika-msoffice I assume (perhaps incorrectly) that this is its way of handling a file-type group, where the detector cannot determine the specific flavor of MS-Office 97-2003 file, based on its analysis.我假设(可能是错误的)这是它处理文件类型组的方式,其中检测器无法根据其分析确定 MS-Office 97-2003 文件的特定风格。 This is similar to the application/x-tika-ooxml in your question.这类似于您问题中的application/x-tika-ooxml

Expected Results预期成绩

Based on the IANA list here , and a Mozilla list here , these are the media types we expect to get for the following file types:根据IANA的名单 这里,和一个Mozilla列表在这里,这些都是我们希望得到以下文件类型的媒体类型:

  • .pdf :: application/pdf .pdf :: 应用程序/pdf
  • .xls :: application/vnd.ms-excel .xls :: 应用程序/vnd.ms-excel
  • .doc :: application/msword .doc :: 应用程序/msword
  • .xlsx :: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet .xlsx :: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  • .docx :: application/vnd.openxmlformats-officedocument.wordprocessingml.document .docx :: application/vnd.openxmlformats-officedocument.wordprocessingml.document

The Program该程序

The program shown below uses the following Maven dependencies:下面显示的程序使用以下 Maven 依赖项:

    <dependencies>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.23</version>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.23</version>
        </dependency>
        <dependency>
            <groupId>javax.ws.rs</groupId>
            <artifactId>javax.ws.rs-api</artifactId>
            <version>2.1.1</version>
        </dependency>
    </dependencies>

The program (just for this demo - not production ready) is shown below.该程序(仅用于此演示 - 未准备好生产)如下所示。 Specifically, look at the tikaDetect() and tikaParse() methods.具体来说,看看tikaDetect()tikaParse()方法。

import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import java.util.Set;
import java.util.HashSet;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.detect.Detector;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.SAXException;
import org.xml.sax.ContentHandler;

public class Main {

    private final Set<File> msOfficeFiles = new HashSet();

    public static void main(String[] args) throws IOException, MimeTypeException,
            SAXException, TikaException {
        Main main = new Main();
        main.doFileDetection();
    }

    private void doFileDetection() throws IOException, MimeTypeException, SAXException, TikaException {
        File file1 = new File("C:/tmp/foo.pdf");
        File file2 = new File("C:/tmp/baz.xlsx");
        File file3 = new File("C:/tmp/bat.docx");
        // Excel 97-2003 format:
        File file4 = new File("C:/tmp/bar.xls");
        // Word 97-2003 format:
        File file5 = new File("C:/tmp/daz.doc");
        Set<File> files = new HashSet();
        files.add(file1);
        files.add(file2);
        files.add(file3);
        files.add(file4);
        files.add(file5);

        for (File file : files) {
            try (BufferedInputStream bis = new BufferedInputStream(
                    new FileInputStream(file))) {
                tikaDetect(file, bis);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        for (File file : msOfficeFiles) {
            tikaParse(file);
        }
    }

    private void tikaDetect(File file, BufferedInputStream bis)
            throws IOException, SAXException, TikaException {
        Detector detector = new DefaultDetector();
        Metadata metadata = new Metadata();
        MediaType mediaType = detector.detect(bis, metadata);
        if (mediaType.toString().equals("application/x-tika-msoffice")) {
            msOfficeFiles.add(file);
        } else {
            System.out.println("Media Type for " + file.getName()
                    + " is: " + mediaType.toString());
        }
    }

    private void tikaParse(File file) throws SAXException, TikaException {
        Parser parser = new AutoDetectParser();
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
        try (BufferedInputStream bis = new BufferedInputStream(
                new FileInputStream(file))) {
            parser.parse(bis, handler, metadata, context);
            tikaDetect(file, bis);
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println("Media Type for " + file.getName()
                + " is: " + metadata.get("Content-Type"));
    }
}

Actual Results实际结果

The program generates some warnings and information messages.该程序会生成一些警告和信息消息。 If we ignore these for this exercise, we get the following print statements:如果我们在本练习中忽略这些,我们会得到以下打印语句:

Media Type for bat.docx is: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Media Type for baz.xlsx is: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Media Type for foo.pdf is: application/pdf
Media Type for bar.xls is: application/vnd.ms-excel
Media Type for daz.doc is: application/msword

These match the expected official media (MIME) types.这些匹配预期的官方媒体 (MIME) 类型。

Tika official usages: https://tika.apache.org/1.26/detection.html Tika 官方用法: https : //tika.apache.org/1.26/detection.html

Tika supported formats: https://tika.apache.org/1.26/formats.html Tika 支持的格式: https : //tika.apache.org/1.26/formats.html

You could get the answers by simply reading the above 2 pages.您只需阅读以上 2 页即可获得答案。 Here are some key quotes:以下是一些关键引述:

Microsoft Office and some related applications produce documents in the generic OLE 2 Compound Document and Office Open XML (OOXML) formats. Microsoft Office 和一些相关应用程序生成通用 OLE 2 复合文档和 Office Open XML (OOXML) 格式的文档。 The older OLE 2 format was introduced in Microsoft Office version 97 and was the default format until Office version 2007 and the new XML-based OOXML format.较旧的 OLE 2 格式是在 Microsoft Office 97 版中引入的,并且是 Office 2007 版和新的基于 XML 的 OOXML 格式之前的默认格式。 The OfficeParser and OOXMLParser classes use Apache POI libraries to support text and metadata extraction from both OLE2 and OOXML documents. OfficeParser 和 OOXMLParser 类使用 Apache POI 库来支持从 OLE2 和 OOXML 文档中提取文本和元数据。

That means you need to include also Apache POI jars or Maven dependencies for MS office files.这意味着您还需要为 MS Office 文件包含 Apache POI jar 或 Maven 依赖项

Tika provides a wrapping detector in the form of org.apache.tika.detect.DefaultDetector. Tika 以 org.apache.tika.detect.DefaultDetector 的形式提供了一个包装检测器。 This uses the service loader to discover all available detectors, including any available container aware ones, and tries them in turn.这使用服务加载器来发现所有可用的检测器,包括任何可用的容器感知检测器,并依次尝试它们。 For container aware detection, include the Tika Parsers jar and its dependencies in your project, then use DefaultDetector along with a TikaInputStream.对于容器感知检测,在您的项目中包含 Tika Parsers jar 及其依赖项,然后使用 DefaultDetector 和 TikaInputStream。

That means you need to include the Tika Parsers jar or Maven dependencies .这意味着您需要包含 Tika Parsers jar 或 Maven 依赖项

Then use new DefaultDetector().detect(TikaInputStream.get(file), new Metadata());然后使用new DefaultDetector().detect(TikaInputStream.get(file), new Metadata());

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM