简体   繁体   中英

C# How can I get the text from PDF from PDF page url

C# How can I get the text from PDF from pdf page url

for example a web page contains a PDF text, i want to read all text from the page

PDFBox is a Java PDF Library you can use also in C#.

You should do :

1.Unzip the package "PDFBox.zip", get

IKVM.GNU.Classpath.dll
PDFBox-0.7.3.dll
FontBox-0.1.0-dev.dll
IKVM.Runtime.dll

2.Import this DLLs into your C# project. Using:

using org.pdfbox.pdmodel;
using org.pdfbox.util;

3.Write your code may like this:

using System.IO;
using System.Text;
using org.pdfbox.pdmodel;
using org.pdfbox.util;

namespace PDFReader
{
    class Program
    {

        public static void pdf2txt(FileInfo pdffile, FileInfo txtfile)
        {

            PDDocument doc = PDDocument.load(pdffile.FullName);

            PDFTextStripper pdfStripper = new PDFTextStripper();

            string text = pdfStripper.getText(doc);

            StreamWriter swPdfChange = new StreamWriter(txtfile.FullName, false, Encoding.GetEncoding("gb2312"));

            swPdfChange.Write(text);

            swPdfChange.Close();

        }

        static void Main(string[] args)
        {
            pdf2txt(new FileInfo(@"C:/Users/yourpdf.pdf"), new FileInfo(@"C:/Users/yourcontent.txt"));
        }
    }
}

Hope this can help you.

//First send the source path of page www.abc.com

public byte[] GetByteArray(string sourcePath)
    {
        byte[] outBytes = null;
        try
        {
            using (WebClient wc = new WebClient())
            {
                outBytes = wc.DownloadData(sourcePath);
            }
        }
        catch (Exception ex)
        {
            throw ex;
        }
        return outBytes;
    }

//above method retuns a byte array use that byte array
// use Itextsharp.dll for getting text from byte array //For downloading aboove library use link given https://sourceforge.net/projects/itextsharp/

  public string[] GetLines(byte[] outBytes)
    {
        string resultPdfText = "";
        string[] lines = null;
        try
        {
            MemoryStream outPDF = new MemoryStream();
            using (PdfReader pdfr = new PdfReader(outBytes))
            {
                iTextSharp.text.Document doc = new iTextSharp.text.Document();
                iTextSharp.text.Document.Compress = true;
                PdfWriter writer = PdfWriter.GetInstance(doc, outPDF);
                doc.Open();
                for (int i = 1; i <= pdfr.NumberOfPages; i++)
                {
                    resultPdfText += PdfTextExtractor.GetTextFromPage(pdfr, i, new LocationTextExtractionStrategy());
                }
                lines = resultPdfText.Split('\n');
            }
        }
        catch (Exception ex)
        {
            throw ex;
        }
        return lines;
    }

if you want to load pdf from online source then add this code use this library

using System.IO;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
using System.Text;
using java.net;

and in code change load file method using new URL() method like this

        PDDocument doc = PDDocument.load((new URL("http://www.pdf995.com/samples/pdf.pdf")));

        PDFTextStripper pdfStripper = new PDFTextStripper();

        string text = pdfStripper.getText(doc);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM