[英]When I read PDF from URL it shows English data correctly but other language text is not Properly in C#
This code Convert only English PDF code in English text, And I want to Convert Any other Language to English, So how can, I Solve this Problem.此代码仅将英文 PDF 代码转换为英文文本,而我想将任何其他语言转换为英文,那怎么办,我解决了这个问题。
Below is my code下面是我的代码
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
private string PDFReader(string url)
{
StringBuilder text = new StringBuilder();
PdfReader pdfReader;
try
{
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
url = "http://www.openprocurement.al/tenders/shpallje/29357.pdf";
pdfReader = new PdfReader(url);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentText.Contains("Page " + page.ToString()))
{
currentText = currentText.Replace("Page " + page.ToString(), "♥♥");
}
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append("\n----------------------------------------------------------------------\n");
text.Append(currentText);
}
pdfReader.Close();
}
catch (Exception ex)
{
}
return text.Replace("‘", "‘").Replace("’", "’").Replace("–", "–").ToString();
}
.NET strings are Unicode, specifically UTF16. .NET 字符串是 Unicode,特别是 UTF16。 They don't need any kind of conversion.他们不需要任何形式的转换。
The problems are caused by the attempt to convert Unicode to the local machine's locale then back to Unicode as if it were UTF8 (which it isn't, it's in the local machine's locale).问题是由于尝试将 Unicode 转换为本地计算机的语言环境,然后再转换回 Unicode,就好像它是 UTF8 一样(它不是,它在本地计算机的语言环境中)。 That's what produces the â€
strings too - the two-byte UTF8 sequences are translated as ASCII (most likely Western European).这也是产生â€
字符串的原因 - 两字节 UTF8 序列被翻译为 ASCII(很可能是西欧)。
This code extracts the text without any conversion issues:此代码提取文本没有任何转换问题:
static string GetPdfText(string url)
{
var separator="\n----------------------------------------------------------------------\n";
var text = new StringBuilder();
var strategy = new SimpleTextExtractionStrategy();
using( var pdfReader = new PdfReader(url))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(separator);
text.Append(currentText);
}
}
return text.ToString();
}
Please try this..请试试这个..
Using the WhatsMate PDF-to-Text REST API使用 WhatsMate PDF 到文本 REST API
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.