簡體   English   中英

使用iTextSharp從PDF刪除水印

[英]Removing Watermark from a PDF using iTextSharp

我使用Pdfstamper在pdf上添加了水印。 這是代碼:

for (int pageIndex = 1; pageIndex <= pageCount; pageIndex++)
{
    iTextSharp.text.Rectangle pageRectangle = reader.GetPageSizeWithRotation(pageIndex);
    PdfContentByte pdfData = stamper.GetUnderContent(pageIndex);
    pdfData.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, 
        BaseFont.NOT_EMBEDDED), watermarkFontSize);
    PdfGState graphicsState = new PdfGState();
    graphicsState.FillOpacity = watermarkFontOpacity;
    pdfData.SetGState(graphicsState);
    pdfData.SetColorFill(iTextSharp.text.BaseColor.BLACK);
    pdfData.BeginText();
    pdfData.ShowTextAligned(PdfContentByte.ALIGN_CENTER, "LipikaChatterjee", 
        pageRectangle.Width / 2, pageRectangle.Height / 2, watermarkRotation);
    pdfData.EndText();
}

這很好。 現在,我想從pdf中刪除該水印。 我調查了iTextSharp,但無法獲得任何幫助。 我什至嘗試將水印添加為圖層,然后刪除該圖層,但無法從pdf中刪除圖層的內容。 我調查了iText的圖層刪除過程,發現了一個OCGRemover類,但無法在iTextsharp中獲得一個等效的類。

我將基於“我什至試圖將水印添加為圖層”的陳述,並假設您正在處理自己創建的內容,而不是嘗試對其他人的內容取消加水印,來為您帶來疑問的好處。

PDF使用可選內容組(OCG)將對象存儲為圖層。 如果將水印文本添加到圖層,則以后可以很輕松地將其刪除。

下面的代碼是一個針對iTextSharp 5.1.1.0的功能全面的C#2010 WinForms應用程序。 它使用基於此處找到的Bruno原始Java代碼的代碼 該代碼分為三個部分。 第1部分創建了供我們使用的樣本PDF。 第2節從頭開始創建新的PDF,並將水印應用於單獨圖層上的每個頁面。 第三部分從第二部分創建了最終的PDF,但是刪除了帶有水印文本的圖層。 有關其他詳細信息,請參見代碼注釋。

創建PdfLayer對象時,可以為其指定一個名稱,使其出現在PDF閱讀器中。 不幸的是,我找不到訪問此名稱的方法,因此下面的代碼在圖層中查找實際的水印文本。 如果您不使用其他PDF層,建議您在內容流中查找/OC ,而不要浪費時間查找實際的水印文本。 如果您找到一種按名稱查找/OC組的方法,請讓我知道!

using System;
using System.Windows.Forms;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;

namespace WindowsFormsApplication1 {
    public partial class Form1 : Form {
        public Form1() {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e) {
            string workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
            string startFile = Path.Combine(workingFolder, "StartFile.pdf");
            string watermarkedFile = Path.Combine(workingFolder, "Watermarked.pdf");
            string unwatermarkedFile = Path.Combine(workingFolder, "Un-watermarked.pdf");
            string watermarkText = "This is a test";

            //SECTION 1
            //Create a 5 page PDF, nothing special here
            using (FileStream fs = new FileStream(startFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
                using (Document doc = new Document(PageSize.LETTER)) {
                    using (PdfWriter witier = PdfWriter.GetInstance(doc, fs)) {
                        doc.Open();

                        for (int i = 1; i <= 5; i++) {
                            doc.NewPage();
                            doc.Add(new Paragraph(String.Format("This is page {0}", i)));
                        }

                        doc.Close();
                    }
                }
            }

            //SECTION 2
            //Create our watermark on a separate layer. The only different here is that we are adding the watermark to a PdfLayer which is an OCG or Optional Content Group
            PdfReader reader1 = new PdfReader(startFile);
            using (FileStream fs = new FileStream(watermarkedFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
                using (PdfStamper stamper = new PdfStamper(reader1, fs)) {
                    int pageCount1 = reader1.NumberOfPages;
                    //Create a new layer
                    PdfLayer layer = new PdfLayer("WatermarkLayer", stamper.Writer);
                    for (int i = 1; i <= pageCount1; i++) {
                        iTextSharp.text.Rectangle rect = reader1.GetPageSize(i);
                        //Get the ContentByte object
                        PdfContentByte cb = stamper.GetUnderContent(i);
                        //Tell the CB that the next commands should be "bound" to this new layer
                        cb.BeginLayer(layer);
                        cb.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 50);
                        PdfGState gState = new PdfGState();
                        gState.FillOpacity = 0.25f;
                        cb.SetGState(gState);
                        cb.SetColorFill(BaseColor.BLACK);
                        cb.BeginText();
                        cb.ShowTextAligned(PdfContentByte.ALIGN_CENTER, watermarkText, rect.Width / 2, rect.Height / 2, 45f);
                        cb.EndText();
                        //"Close" the layer
                        cb.EndLayer();
                    }
                }
            }

            //SECTION 3
            //Remove the layer created above
            //First we bind a reader to the watermarked file, then strip out a bunch of things, and finally use a simple stamper to write out the edited reader
            PdfReader reader2 = new PdfReader(watermarkedFile);

            //NOTE, This will destroy all layers in the document, only use if you don't have additional layers
            //Remove the OCG group completely from the document.
            //reader2.Catalog.Remove(PdfName.OCPROPERTIES);

            //Clean up the reader, optional
            reader2.RemoveUnusedObjects();

            //Placeholder variables
            PRStream stream;
            String content;
            PdfDictionary page;
            PdfArray contentarray;

            //Get the page count
            int pageCount2 = reader2.NumberOfPages;
            //Loop through each page
            for (int i = 1; i <= pageCount2; i++) {
                //Get the page
                page = reader2.GetPageN(i);
                //Get the raw content
                contentarray = page.GetAsArray(PdfName.CONTENTS);
                if (contentarray != null) {
                    //Loop through content
                    for (int j = 0; j < contentarray.Size; j++) {
                        //Get the raw byte stream
                        stream = (PRStream)contentarray.GetAsStream(j);
                        //Convert to a string. NOTE, you might need a different encoding here
                        content = System.Text.Encoding.ASCII.GetString(PdfReader.GetStreamBytes(stream));
                        //Look for the OCG token in the stream as well as our watermarked text
                        if (content.IndexOf("/OC") >= 0 && content.IndexOf(watermarkText) >= 0) {
                            //Remove it by giving it zero length and zero data
                            stream.Put(PdfName.LENGTH, new PdfNumber(0));
                            stream.SetData(new byte[0]);
                        }
                    }
                }
            }

            //Write the content out
            using (FileStream fs = new FileStream(unwatermarkedFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
                using (PdfStamper stamper = new PdfStamper(reader2, fs)) {

                }
            }
            this.Close();
        }
    }
}

作為Chris答案的擴展,在本文的底部包含一個VB.Net類,用於刪除層,該類應該更精確一些。

  1. 它經歷層的PDF文件列表(存儲在OCGs在陣列OCProperties在文件的目錄字典)。 該數組包含對PDF文件中包含名稱的對象的間接引用
  2. 它遍歷頁面的屬性(也存儲在字典中)以找到指向圖層對象的屬性(通過間接引用)
  3. 它會對內容流進行實際解析,以找到模式/OC /{PagePropertyReference} BDC {Actual Content} EMC實例,因此可以適當地刪除這些段

然后,代碼將盡可能多地清理所有引用。 調用代碼可能如下所示工作:

Public Shared Sub RemoveWatermark(path As String, savePath As String)
  Using reader = New PdfReader(path)
    Using fs As New FileStream(savePath, FileMode.Create, FileAccess.Write, FileShare.None)
      Using stamper As New PdfStamper(reader, fs)
        Using remover As New PdfLayerRemover(reader)
          remover.RemoveByName("WatermarkLayer")
        End Using
      End Using
    End Using
  End Using
End Sub

全班:

Imports iTextSharp.text
Imports iTextSharp.text.io
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser

Public Class PdfLayerRemover
  Implements IDisposable

  Private _reader As PdfReader
  Private _layerNames As New List(Of String)

  Public Sub New(reader As PdfReader)
    _reader = reader
  End Sub

  Public Sub RemoveByName(name As String)
    _layerNames.Add(name)
  End Sub

  Private Sub RemoveLayers()
    Dim ocProps = _reader.Catalog.GetAsDict(PdfName.OCPROPERTIES)
    If ocProps Is Nothing Then Return
    Dim ocgs = ocProps.GetAsArray(PdfName.OCGS)
    If ocgs Is Nothing Then Return

    'Get a list of indirect references to the layer information
    Dim layerRefs = (From l In (From i In ocgs
                                Select Obj = DirectCast(PdfReader.GetPdfObject(i), PdfDictionary),
                                       Ref = DirectCast(i, PdfIndirectReference))
                     Where _layerNames.Contains(l.Obj.GetAsString(PdfName.NAME).ToString)
                     Select l.Ref).ToList
    'Get a list of numbers for these layer references
    Dim layerRefNumbers = (From l In layerRefs Select l.Number).ToList

    'Loop through the pages
    Dim page As PdfDictionary
    Dim propsToRemove As IEnumerable(Of PdfName)
    For i As Integer = 1 To _reader.NumberOfPages
      'Get the page
      page = _reader.GetPageN(i)

      'Get the page properties which reference the layers to remove
      Dim props = _reader.GetPageResources(i).GetAsDict(PdfName.PROPERTIES)
      propsToRemove = (From k In props.Keys Where layerRefNumbers.Contains(props.GetAsIndirectObject(k).Number) Select k).ToList

      'Get the raw content
      Dim contentarray = page.GetAsArray(PdfName.CONTENTS)
      If contentarray IsNot Nothing Then
        For j As Integer = 0 To contentarray.Size - 1
          'Parse the stream data looking for references to a property pointing to the layer.
          Dim stream = DirectCast(contentarray.GetAsStream(j), PRStream)
          Dim streamData = PdfReader.GetStreamBytes(stream)
          Dim newData = GetNewStream(streamData, (From p In propsToRemove Select p.ToString.Substring(1)))

          'Store data without the stream references in the stream
          If newData.Length <> streamData.Length Then
            stream.SetData(newData)
            stream.Put(PdfName.LENGTH, New PdfNumber(newData.Length))
          End If
        Next
      End If

      'Remove the properties from the page data
      For Each prop In propsToRemove
        props.Remove(prop)
      Next
    Next

    'Remove references to the layer in the master catalog
    RemoveIndirectReferences(ocProps, layerRefNumbers)

    'Clean up unused objects
    _reader.RemoveUnusedObjects()
  End Sub

  Private Shared Function GetNewStream(data As Byte(), propsToRemove As IEnumerable(Of String)) As Byte()
    Dim item As PdfLayer = Nothing
    Dim positions As New List(Of Integer)
    positions.Add(0)

    Dim pos As Integer
    Dim inGroup As Boolean = False
    Dim tokenizer As New PRTokeniser(New RandomAccessFileOrArray(New RandomAccessSourceFactory().CreateSource(data)))
    While tokenizer.NextToken
      If tokenizer.TokenType = PRTokeniser.TokType.NAME AndAlso tokenizer.StringValue = "OC" Then
        pos = CInt(tokenizer.FilePointer - 3)
        If tokenizer.NextToken() AndAlso tokenizer.TokenType = PRTokeniser.TokType.NAME Then
          If Not inGroup AndAlso propsToRemove.Contains(tokenizer.StringValue) Then
            inGroup = True
            positions.Add(pos)
          End If
        End If
      ElseIf tokenizer.TokenType = PRTokeniser.TokType.OTHER AndAlso tokenizer.StringValue = "EMC" AndAlso inGroup Then
        positions.Add(CInt(tokenizer.FilePointer))
        inGroup = False
      End If
    End While
    positions.Add(data.Length)

    If positions.Count > 2 Then
      Dim length As Integer = 0
      For i As Integer = 0 To positions.Count - 1 Step 2
        length += positions(i + 1) - positions(i)
      Next

      Dim newData(length) As Byte
      length = 0
      For i As Integer = 0 To positions.Count - 1 Step 2
        Array.Copy(data, positions(i), newData, length, positions(i + 1) - positions(i))
        length += positions(i + 1) - positions(i)
      Next

      Dim origStr = System.Text.Encoding.UTF8.GetString(data)
      Dim newStr = System.Text.Encoding.UTF8.GetString(newData)

      Return newData
    Else
      Return data
    End If
  End Function

  Private Shared Sub RemoveIndirectReferences(dict As PdfDictionary, refNumbers As IEnumerable(Of Integer))
    Dim newDict As PdfDictionary
    Dim arrayData As PdfArray
    Dim indirect As PdfIndirectReference
    Dim i As Integer

    For Each key In dict.Keys
      newDict = dict.GetAsDict(key)
      arrayData = dict.GetAsArray(key)
      If newDict IsNot Nothing Then
        RemoveIndirectReferences(newDict, refNumbers)
      ElseIf arrayData IsNot Nothing Then
        i = 0
        While i < arrayData.Size
          indirect = arrayData.GetAsIndirectObject(i)
          If refNumbers.Contains(indirect.Number) Then
            arrayData.Remove(i)
          Else
            i += 1
          End If
        End While
      End If
    Next
  End Sub

#Region "IDisposable Support"
  Private disposedValue As Boolean ' To detect redundant calls

  ' IDisposable
  Protected Overridable Sub Dispose(disposing As Boolean)
    If Not Me.disposedValue Then
      If disposing Then
        RemoveLayers()
      End If

      ' TODO: free unmanaged resources (unmanaged objects) and override Finalize() below.
      ' TODO: set large fields to null.
    End If
    Me.disposedValue = True
  End Sub

  ' TODO: override Finalize() only if Dispose(ByVal disposing As Boolean) above has code to free unmanaged resources.
  'Protected Overrides Sub Finalize()
  '    ' Do not change this code.  Put cleanup code in Dispose(ByVal disposing As Boolean) above.
  '    Dispose(False)
  '    MyBase.Finalize()
  'End Sub

  ' This code added by Visual Basic to correctly implement the disposable pattern.
  Public Sub Dispose() Implements IDisposable.Dispose
    ' Do not change this code.  Put cleanup code in Dispose(ByVal disposing As Boolean) above.
    Dispose(True)
    GC.SuppressFinalize(Me)
  End Sub
#End Region

End Class

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM