简体   繁体   中英

When I read PDF from URL it shows English data correctly but other language text is not Properly in C#

This code Convert only English PDF code in English text, And I want to Convert Any other Language to English, So how can, I Solve this Problem.

Below is my code

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;   
private string PDFReader(string url)
{
     StringBuilder text = new StringBuilder();
       PdfReader pdfReader;       

          try
            {
            ServicePointManager.Expect100Continue = true;
            ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
            url = "http://www.openprocurement.al/tenders/shpallje/29357.pdf";
            pdfReader = new PdfReader(url);
              for (int page = 1; page <= pdfReader.NumberOfPages; page++)
               {
                 ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                 string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
                 if (currentText.Contains("Page " + page.ToString()))
                  {
                   currentText = currentText.Replace("Page " + page.ToString(), "♥♥");
                  }
                  currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                text.Append("\n----------------------------------------------------------------------\n");
                  text.Append(currentText);
               }
                    pdfReader.Close();

           }
           catch (Exception ex)
           {

           }

         return text.Replace("‘", "‘").Replace("’", "’").Replace("–", "–").ToString();
        }

.NET strings are Unicode, specifically UTF16. They don't need any kind of conversion.

The problems are caused by the attempt to convert Unicode to the local machine's locale then back to Unicode as if it were UTF8 (which it isn't, it's in the local machine's locale). That's what produces the †strings too - the two-byte UTF8 sequences are translated as ASCII (most likely Western European).

This code extracts the text without any conversion issues:

static  string GetPdfText(string url)
{
    var separator="\n----------------------------------------------------------------------\n";
    var text = new StringBuilder();                            
    var  strategy = new SimpleTextExtractionStrategy();

    using( var pdfReader = new PdfReader(url))
    {
        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var  currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            text.Append(separator);
            text.Append(currentText);
        }
    }
    return text.ToString();     
}        

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM