When I read PDF from URL it shows English data correctly but other language text is not Properly in C#

Question

This code Convert only English PDF code in English text, And I want to Convert Any other Language to English, So how can, I Solve this Problem.

Below is my code

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;   
private string PDFReader(string url)
{
     StringBuilder text = new StringBuilder();
       PdfReader pdfReader;       

          try
            {
            ServicePointManager.Expect100Continue = true;
            ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
            url = "http://www.openprocurement.al/tenders/shpallje/29357.pdf";
            pdfReader = new PdfReader(url);
              for (int page = 1; page <= pdfReader.NumberOfPages; page++)
               {
                 ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                 string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
                 if (currentText.Contains("Page " + page.ToString()))
                  {
                   currentText = currentText.Replace("Page " + page.ToString(), "♥♥");
                  }
                  currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                text.Append("\n----------------------------------------------------------------------\n");
                  text.Append(currentText);
               }
                    pdfReader.Close();

           }
           catch (Exception ex)
           {

           }

         return text.Replace("â€˜", "‘").Replace("â€™", "’").Replace("â€“", "–").ToString();
        }

Answer 1

.NET strings are Unicode, specifically UTF16. They don't need any kind of conversion.

The problems are caused by the attempt to convert Unicode to the local machine's locale then back to Unicode as if it were UTF8 (which it isn't, it's in the local machine's locale). That's what produces the â€ strings too - the two-byte UTF8 sequences are translated as ASCII (most likely Western European).

This code extracts the text without any conversion issues:

static  string GetPdfText(string url)
{
    var separator="\n----------------------------------------------------------------------\n";
    var text = new StringBuilder();                            
    var  strategy = new SimpleTextExtractionStrategy();

    using( var pdfReader = new PdfReader(url))
    {
        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var  currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            text.Append(separator);
            text.Append(currentText);
        }
    }
    return text.ToString();     
}

Answer 2

Please try this..

Using the WhatsMate PDF-to-Text REST API

When I read PDF from URL it shows English data correctly but other language text is not Properly in C#

Question

2 answers

solution1
0 2019-09-25 11:48:20

solution2
0 2019-09-28 10:23:15

When I read PDF from URL it shows English data correctly but other language text is not Properly in C#

Question

2 answers

solution1 0 2019-09-25 11:48:20

solution2 0 2019-09-28 10:23:15

solution1
0 2019-09-25 11:48:20

solution2
0 2019-09-28 10:23:15