简体   繁体   English

当我从 URL 读取 PDF 时,它正确显示英文数据,但其他语言文本在 C# 中不正确

[英]When I read PDF from URL it shows English data correctly but other language text is not Properly in C#

This code Convert only English PDF code in English text, And I want to Convert Any other Language to English, So how can, I Solve this Problem.此代码仅将英文 PDF 代码转换为英文文本,而我想将任何其他语言转换为英文,那怎么办,我解决了这个问题。

Below is my code下面是我的代码

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;   
private string PDFReader(string url)
{
     StringBuilder text = new StringBuilder();
       PdfReader pdfReader;       

          try
            {
            ServicePointManager.Expect100Continue = true;
            ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
            url = "http://www.openprocurement.al/tenders/shpallje/29357.pdf";
            pdfReader = new PdfReader(url);
              for (int page = 1; page <= pdfReader.NumberOfPages; page++)
               {
                 ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                 string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
                 if (currentText.Contains("Page " + page.ToString()))
                  {
                   currentText = currentText.Replace("Page " + page.ToString(), "♥♥");
                  }
                  currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                text.Append("\n----------------------------------------------------------------------\n");
                  text.Append(currentText);
               }
                    pdfReader.Close();

           }
           catch (Exception ex)
           {

           }

         return text.Replace("‘", "‘").Replace("’", "’").Replace("–", "–").ToString();
        }

.NET strings are Unicode, specifically UTF16. .NET 字符串是 Unicode,特别是 UTF16。 They don't need any kind of conversion.他们不需要任何形式的转换。

The problems are caused by the attempt to convert Unicode to the local machine's locale then back to Unicode as if it were UTF8 (which it isn't, it's in the local machine's locale).问题是由于尝试将 Unicode 转换为本地计算机的语言环境,然后再转换回 Unicode,就好像它是 UTF8 一样(它不是,它在本地计算机的语言环境中)。 That's what produces the †strings too - the two-byte UTF8 sequences are translated as ASCII (most likely Western European).这也是产生â€字符串的原因 - 两字节 UTF8 序列被翻译为 ASCII(很可能是西欧)。

This code extracts the text without any conversion issues:此代码提取文本没有任何转换问题:

static  string GetPdfText(string url)
{
    var separator="\n----------------------------------------------------------------------\n";
    var text = new StringBuilder();                            
    var  strategy = new SimpleTextExtractionStrategy();

    using( var pdfReader = new PdfReader(url))
    {
        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var  currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            text.Append(separator);
            text.Append(currentText);
        }
    }
    return text.ToString();     
}        

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在C#中将键盘语言从英语更改为其他语言时,如何以编程方式比较零 - How to compare number zero programmatically when the keyboard language is changed from english to other language in C# 基于英文键盘的文化名称,我如何在 c# 中键入其他语言 - Based on culture name from english keyboard how can i type other language in c# C#如何从PDF页面网址获取PDF文本 - C# How can I get the text from PDF from PDF page url 除英语外的C#WritePrivateProfileString()值 - C# WritePrivateProfileString() value other than English language 在C#中使用英语以外的默认UI语言是否可以? - Is it ok to use a default UI language other than English in C#? C# - 当应用程序使用其他语言时,以英语获取异常消息? - C# - Getting Exception messages in English when the application is in another language? 如何仅从文本中删除URL并忽略C#中的其他URL - How do I remove only the url from text and Ignore other url in c# 在PDF或DOC文件中检测文本的语言为英语 - Detect the language of a text is english in PDF or DOC files c# 如何从 acrobat pdf 中的标签读取文本 - c# how can i read text from tags in acrobat pdf 无法在C#中通过ITextSharp从pdf读取文本 - Cannot read text from pdf by ITextSharp in C#
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM