简体   繁体   中英

Extract pdf file on ftp server using itextsharp

I am working on Document management project and I want to extract text from pdf. How can I achieve this. I am using Itextsharp to extract pdf on local system

This is a function I am using for this purpose. Path is a FTP Server Path

 public static string ExtractTextFromPdf(string path)
    {
        using (PdfReader reader = new PdfReader(path))
        {
            StringBuilder text = new StringBuilder();

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
            }

            return text.ToString();
        }
    } 

It throws an exception

'ftp:\\###\index\500199.pdf not found as file or resource.'

[### is my ftp server]

PdfReader has a bunch of constructor overloads but most of them rely on RandomAccessSourceFactory to convert whatever is passed in into a Stream format. When you pass a string in it is checked if it is a file on disk and if not it is checked if it can be converted to a Uri as one of file:/ , http:// or https:// link. This is your first point of failure because none of these checks handle the ftp protocol and you ultimately end up at a local resource loader which doesn't work for you.

You could try converting your string to an explicit Uri but that actually won't work, either:

//This won't work
new PdfReader(new Uri(path))

The reason that this won't work is because iText tells .Net to use CredentialCache.DefaultCredentials when loading remote resources however that concept doesn't exist in the FTP world.

Long story short, when using FTP you'll want to download the files on your own. Depending on their size you'll want to either download them to disk or download them a byte array. Below is a sample of the latter:

Byte[] bytes;
if( path.StartsWith(@"ftp://")) {
    var wc = WebRequest.Create(path);
    using (var response = wc.GetResponse()) {
        using (var responseStream = response.GetResponseStream()) {
            bytes = iTextSharp.text.io.StreamUtil.InputStreamToArray(responseStream);
        }
    }
}

You can then pass either the local file or the byte array to the PdfReader constructor.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM