简体   繁体   中英

String.Replace on data from ItextSharp

I'm using ItextSharp to read data from a pdf. Inspecting the resulting string looks correct, however string.Replace fails to replace text.

Therefore, I'm guessing this is some sort of encoding issue, but I'm failing to pin it down.

My code to import the text from PDF should convert into UTF8

 PdfReader pdfReader = new PdfReader("file.pdf");

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.AppendLine(currentText);
                }
                pdfReader.Close();

Then I am trying to replace three hyphens and a space (-- -) into just 3 hyphens (---)

input = input.Replace("-- -­", "---");

Removing the utf8 conversion from the PDF import does not make a difference (see screenshot below - breakpoint after the replace function, but the text is still there):

在文本可视化器中显示字符串替换的结果

EDIT:

Here is a link to a sample file . When opened in notepad or ++, it displays a series of spaces and hyphens (see npp screenshot with whitespace rendering). However when read in c# this file does not get interpreted as unicode hyphen and Unicode space. 在此处输入图片说明

It turns out that either ITextSharp or the source PDF is using something called a soft hypen to represent a standard hypen, so whilst notepad, notepad++ and Visual studio text visualiser all render the soft hypen as a standard hypen, they are not the same character and that is why String.Replace does not perform any replacements.

From my understanding of a soft hyphen, in normally should not be rendered, which was causing odd behavior when trying to paste the character into a web browser or other programs such as charmap - or even visual studio itself.

This resulted in the following working code:

input = input.Replace("­­ ­", "---");

On Firefox, this renders as replacing a space with three hyphens, however pasting into notepad displays (which shows my real intention).

input = input.Replace("-- -", "---");

https://en.wikipedia.org/wiki/Soft_hyphen

Soft Hyphen: http://www.fileformat.info/info/unicode/char/ad/index.htm

Hyphen (standard hyphen) http://www.fileformat.info/info/unicode/char/2010/index.htm

My solution was to add the following line:

        input = input.Replace((char)173, '-');

tl;dr: Character encoding was absolutely fine, not all hyphens are equal.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM