简体   繁体   English

有没有办法从 Dropbox 文件系统内的 pdf 中提取文本?

[英]Is there a way to extract text from a pdf inside of a Dropbox file system?

I am working on a project where I need to iterate through a file system, extract text from a pdf, and scan through that text.我正在做一个项目,我需要遍历文件系统,从 pdf 中提取文本,然后扫描该文本。 Previously, the file system was an N drive (which acts as a local file system), so using the java File API, I could access each pdf file.以前,文件系统是一个 N 盘(充当本地文件系统),所以使用 java File API,我可以访问每个 pdf 文件。 Using this method, I would then extract the text:使用这种方法,我将提取文本:

public static String returnStringOfPDFiText(File file)
    {
        try {
        PdfReader reader = new PdfReader(file.getPath());
        int n = reader.getNumberOfPages();
        String pdfText = null;
        for(int i = 1; i<=n; i++)
        {
            pdfText += PdfTextExtractor.getTextFromPage(reader, n);
        }
        reader.close();
            System.out.println(pdfText);
            
            
            return pdfText;
        }
        catch(Exception e)
        {
            System.out.print(e);
            return null;
        }
        
    }

From here, I could scan through the text.从这里,我可以浏览文本。

I now need to do this, but using a dropbox file system.我现在需要这样做,但使用保管箱文件系统。 I can only find a way to get the metadata of each file, though, and not the actual file, so I can extract text.不过,我只能找到一种方法来获取每个文件的元数据,而不是实际文件,因此我可以提取文本。

Is there a way to get the file so I can call this method on the file to extract the text, or to just extract the text directly from the dropbox file?有没有办法获取文件,以便我可以在文件上调用此方法来提取文本,或者直接从保管箱文件中提取文本?

Edit: I am working with the DropboxAPI already (though I might be missing some methods, I haven't read through a lot of the documentation).编辑:我已经在使用 DropboxAPI(虽然我可能缺少一些方法,但我还没有阅读很多文档)。 I am aware of the download method, but I don't want to use it, since we will be working with around 1 gb of pdfs, and downloading it would be super inefficient.我知道下载方法,但我不想使用它,因为我们将使用大约 1 GB 的 pdf,下载它的效率会非常低。

Dropbox does offer an API you can use for listing, uploading, and downloading files, among other operations. Dropbox 确实提供了一个 API,可用于列出、上传和下载文件以及其他操作。 You can find everything you need to get started with the Dropbox API, including documentation, tutorials, and SDKs here:您可以在此处找到开始使用 Dropbox API 所需的一切,包括文档、教程和 SDK:

https://www.dropbox.com/developers https://www.dropbox.com/developers

For Java specifically, we recommend you use the official Dropbox Java SDK:特别是对于 Java,我们建议您使用官方的 Dropbox Java SDK:

https://github.com/dropbox/dropbox-sdk-java https://github.com/dropbox/dropbox-sdk-java

To download a file's contents using that, you can use the download method:要使用它下载文件的内容,您可以使用download方法:

https://dropbox.github.io/dropbox-sdk-java/api-docs/v5.2.0/com/dropbox/core/v2/files/DbxUserFilesRequests.html#download(java.lang.String) https://dropbox.github.io/dropbox-sdk-java/api-docs/v5.2.0/com/dropbox/core/v2/files/DbxUserFilesRequests.html#download(java.lang.String)

You can find an example of that here:你可以在这里找到一个例子:

https://github.com/dropbox/dropbox-sdk-java/blob/e52fc828c7c753e04c3fa9d47ab6de7e85d000c4/examples/tutorial/src/main/java/com/dropbox/core/examples/tutorial/Main.java#L54 https://github.com/dropbox/dropbox-sdk-java/blob/e52fc828c7c753e04c3fa9d47ab6de7e85d000c4/examples/tutorial/src/main/java/com/dropbox/core/examples/tutorial/Main.java#L54

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM