This code Convert only English PDF code in English text, And I want to Convert Any other Language to English, So how can, I Solve this Problem.
Below is my code
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
private string PDFReader(string url)
{
StringBuilder text = new StringBuilder();
PdfReader pdfReader;
try
{
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
url = "http://www.openprocurement.al/tenders/shpallje/29357.pdf";
pdfReader = new PdfReader(url);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentText.Contains("Page " + page.ToString()))
{
currentText = currentText.Replace("Page " + page.ToString(), "♥♥");
}
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append("\n----------------------------------------------------------------------\n");
text.Append(currentText);
}
pdfReader.Close();
}
catch (Exception ex)
{
}
return text.Replace("‘", "‘").Replace("’", "’").Replace("–", "–").ToString();
}
.NET strings are Unicode, specifically UTF16. They don't need any kind of conversion.
The problems are caused by the attempt to convert Unicode to the local machine's locale then back to Unicode as if it were UTF8 (which it isn't, it's in the local machine's locale). That's what produces the â€
strings too - the two-byte UTF8 sequences are translated as ASCII (most likely Western European).
This code extracts the text without any conversion issues:
static string GetPdfText(string url)
{
var separator="\n----------------------------------------------------------------------\n";
var text = new StringBuilder();
var strategy = new SimpleTextExtractionStrategy();
using( var pdfReader = new PdfReader(url))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(separator);
text.Append(currentText);
}
}
return text.ToString();
}
Please try this..
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.