简体   繁体   中英

Hebrew content not displayed when converting html to PDF using iTextSharp 5.5.8?

I am using the below code to convert an Html file to Pdf using iTextSharp

    Document doc = new Document(iTextSharp.text.PageSize.A4, 10, 20, 5, 35);
    var writer = PdfWriter.GetInstance(doc, new FileStream(savePath, FileMode.Create));

    var xmlWorkerFontProvider = new XMLWorkerFontProvider();
    var cssAppliers = new CssAppliersImpl(new MyFontProvider());
    CssFilesImpl cssFiles = new CssFilesImpl();
    StyleAttrCSSResolver cssResolver = new StyleAttrCSSResolver(cssFiles);

    HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers);
    htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
    htmlContext.SetImageProvider(new ITextImageHandler());

    IPipeline pipeline = new CssResolverPipeline(cssResolver, new HtmlPipeline(htmlContext, new PdfWriterPipeline(doc, writer)));
    XMLWorker worker = new XMLWorker(pipeline, true);
    XMLParser xmlParser = new XMLParser(true, worker, Encoding.Unicode);

    doc.Open();
    doc.NewPage();
    xmlParser.Parse(new StringReader(htmlString.ToString()));
    doc.Close();

For English content this is working fine. But if the content is in Hebrew then text is not displayed in the PDF.

I have checked other answers related to this on Stack-overflow but they seem to use HtmlParser which is deprecated. So I don't want to use that.

Please let me know if any thing else is required. Thanks for you time.

Edit: After reading the comments I have tried settings the fonts as well. But still no luck. Below is the updated code.

 Document document = new Document();

        PdfWriter writer =
            PdfWriter.GetInstance(document, new FileStream(savePath, FileMode.Create));

        document.Open();

        var cssResolver = new StyleAttrCSSResolver();
        XMLWorkerFontProvider fontProvider =
            new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
        fontProvider.Register(@"E:\fonts\NotoSansHebrew-Regular.ttf");


        CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);
        HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers);
        htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
        htmlContext.SetImageProvider(new ITextImageHandler());


        PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
        HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
        CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);


        XMLWorker worker = new XMLWorker(css, true);
        XMLParser p = new XMLParser(worker);

        p.Parse(new StringReader(htmlString.ToString()));

        document.Close();

Below is an adaptation of Bruno's code with some actual HTML. To run it you just need to download the font Noto Sans Hebrew and place it on your desktop. Without any modifications (except possibly filepaths) try running this code which works for me. (I tested this against 5.5.5 so 5.5.8 should absolutely work.)

var file = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
var fontFile = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "NotoSansHebrew-Regular.ttf");
var htmlText = @"<div dir=""rtl"" style=""font-family: Noto Sans Hebrew;"">שלום עולם</div>";

using (var FS = new System.IO.FileStream(file, FileMode.Create, FileAccess.Write, FileShare.None)) {
    using (var document = new Document()) {
        using (var writer = PdfWriter.GetInstance(document, FS)) {
            document.Open();

            var cssResolver = new StyleAttrCSSResolver();
            var fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
            fontProvider.Register(fontFile);
            var cssAppliers = new CssAppliersImpl(fontProvider);
            var htmlContext = new HtmlPipelineContext(cssAppliers);
            htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());

            var pdf = new PdfWriterPipeline(document, writer);
            var html = new HtmlPipeline(htmlContext, pdf);
            var css = new CssResolverPipeline(cssResolver, html);


            var worker = new XMLWorker(css, true);
            var p = new XMLParser(worker);

            using (var ms = new System.IO.MemoryStream(System.Text.Encoding.UTF8.GetBytes(htmlText))) {
                using (var sr = new StreamReader(ms)) {
                    p.Parse(sr);
                }
            }

            document.Close();
        }
    }
}

The trick to this whole thing is to get the exact name of the font in your HTML as it is in the font file. What's confusing sometimes is that fonts can actually have a bunch of names inside of them. And the older the font, the more likely that its going to have these. If I remember correctly, iText has some heuristics for determining the font name but if you want to play it safe you can also just use an alias and call it whatever you want. For instance, you can change the HTML to:

var htmlText = @"<div dir=""rtl"" style=""font-family: Gerp;"">שלום עולם</div>";

And everything will work just fine as long as you alias your font when registering it:

fontProvider.Register(fontFile, "Gerp");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM