简体   繁体   中英

C# iTextSharp Chinese character extraction

So I am trying to extract Chinese words from a PDF.

I have the extraction working using iTextSharp, but once extracted, I find a gap/hidden space has been added in between the extracted characters. This is a huge problem because it breaks the characters into multiple space separated strings even though it should only be a single string.

I have tried various encodings, and they all create the same gap/hidden/added space. This usually happens with strings of 3 or more Chinese characters.

My goal is to extract the complete Chinese strings as shown below. Any and all help is appreciated.

Sample data in PDF: A请 B请请 C请请请 D请请请请 E请请请请请 F请请请请请请 1请 2请请 3请请请 4请请请请 5请请请请请 6请请请请请请 请 请请 请请请 请请请请 请请请请请 请请请请请请

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

byte[] bytes = Encoding.Unicode.GetBytes(currentText);
byte[] cleanbytes = ScrubBytes(bytes);

char[] chars = Encoding.Unicode.GetChars(bytes);                      

currentText="A 请 B 请请 C 请 请请 D 请 请请请 E 请 请请请 请 F 请 请请请请 请 \n1 请 2 请请 3 请请 请 4 请请 请请 5 请请 请请请 6 请请 请请请请 \n请 请请 请 请请 请请请 请 请请请请 请 请请 请请请 请 "

As you see, a gap has been added in the C and D, and 2 gaps have been inserted in E and F.

character array for content data listed below:

[0] 65 'A'  char
        [1] 32 ' '  char
        [2] 35831 '请'   char
        [3] 32 ' '  char
        [4] 66 'B'  char
        [5] 32 ' '  char
        [6] 35831 '请'   char
        [7] 35831 '请'   char
        [8] 32 ' '  char
        [9] 67 'C'  char
        [10]    32 ' '  char
        [11]    35831 '请'   char
        [12]    32 ' '  char
        [13]    35831 '请'   char
        [14]    35831 '请'   char
        [15]    32 ' '  char
        [16]    68 'D'  char
        [17]    32 ' '  char
        [18]    35831 '请'   char
        [19]    32 ' '  char
        [20]    35831 '请'   char
        [21]    35831 '请'   char
        [22]    35831 '请'   char
        [23]    32 ' '  char
        [24]    69 'E'  char
        [25]    32 ' '  char
        [26]    35831 '请'   char
        [27]    32 ' '  char
        [28]    35831 '请'   char
        [29]    35831 '请'   char
        [30]    35831 '请'   char
        [31]    32 ' '  char
        [32]    35831 '请'   char
        [33]    32 ' '  char
        [34]    70 'F'  char
        [35]    32 ' '  char
        [36]    35831 '请'   char
        [37]    32 ' '  char
        [38]    35831 '请'   char
        [39]    35831 '请'   char
        [40]    35831 '请'   char
        [41]    35831 '请'   char
        [42]    32 ' '  char
        [43]    35831 '请'   char
        [44]    32 ' '  char
        [45]    10 '\n' char
        [46]    49 '1'  char
        [47]    32 ' '  char
        [48]    35831 '请'   char
        [49]    32 ' '  char
        [50]    50 '2'  char
        [51]    32 ' '  char
        [52]    35831 '请'   char
        [53]    35831 '请'   char
        [54]    32 ' '  char
        [55]    51 '3'  char
        [56]    32 ' '  char
        [57]    35831 '请'   char
        [58]    35831 '请'   char
        [59]    32 ' '  char
        [60]    35831 '请'   char
        [61]    32 ' '  char
        [62]    52 '4'  char
        [63]    32 ' '  char
        [64]    35831 '请'   char
        [65]    35831 '请'   char
        [66]    32 ' '  char
        [67]    35831 '请'   char
        [68]    35831 '请'   char
        [69]    32 ' '  char
        [70]    53 '5'  char
        [71]    32 ' '  char
        [72]    35831 '请'   char
        [73]    35831 '请'   char
        [74]    32 ' '  char
        [75]    35831 '请'   char
        [76]    35831 '请'   char
        [77]    35831 '请'   char
        [78]    32 ' '  char
        [79]    54 '6'  char
        [80]    32 ' '  char
        [81]    35831 '请'   char
        [82]    35831 '请'   char
        [83]    32 ' '  char
        [84]    35831 '请'   char
        [85]    35831 '请'   char
        [86]    35831 '请'   char
        [87]    35831 '请'   char
        [88]    32 ' '  char
        [89]    10 '\n' char
        [90]    35831 '请'   char
        [91]    32 ' '  char
        [92]    35831 '请'   char
        [93]    35831 '请'   char
        [94]    32 ' '  char
        [95]    35831 '请'   char
        [96]    32 ' '  char
        [97]    35831 '请'   char
        [98]    35831 '请'   char
        [99]    32 ' '  char
        [100]   35831 '请'   char
        [101]   35831 '请'   char
        [102]   35831 '请'   char
        [103]   32 ' '  char
        [104]   35831 '请'   char
        [105]   32 ' '  char
        [106]   35831 '请'   char
        [107]   35831 '请'   char
        [108]   35831 '请'   char
        [109]   35831 '请'   char
        [110]   32 ' '  char
        [111]   35831 '请'   char
        [112]   32 ' '  char
        [113]   35831 '请'   char
        [114]   35831 '请'   char
        [115]   32 ' '  char
        [116]   35831 '请'   char
        [117]   35831 '请'   char
        [118]   35831 '请'   char
        [119]   32 ' '  char
        [120]   35831 '请'   char
        [121]   32 ' '  char

bytes below:

bytes   {byte[244]} byte[]
[0] 65  byte
[1] 0   byte
[2] 32  byte
[3] 0   byte
[4] 247 byte
[5] 139 byte
[6] 32  byte
[7] 0   byte
[8] 66  byte
[9] 0   byte
[10]    32  byte
[11]    0   byte
[12]    247 byte
[13]    139 byte
[14]    247 byte
[15]    139 byte
[16]    32  byte
[17]    0   byte
[18]    67  byte
[19]    0   byte
[20]    32  byte
[21]    0   byte
[22]    247 byte
[23]    139 byte
[24]    32  byte
[25]    0   byte
[26]    247 byte
[27]    139 byte
[28]    247 byte
[29]    139 byte
[30]    32  byte
[31]    0   byte
[32]    68  byte
[33]    0   byte
[34]    32  byte
[35]    0   byte
[36]    247 byte
[37]    139 byte
[38]    32  byte
[39]    0   byte
[40]    247 byte
[41]    139 byte
[42]    247 byte
[43]    139 byte
[44]    247 byte
[45]    139 byte
[46]    32  byte
[47]    0   byte
[48]    69  byte
[49]    0   byte
[50]    32  byte
[51]    0   byte
[52]    247 byte
[53]    139 byte
[54]    32  byte
[55]    0   byte
[56]    247 byte
[57]    139 byte
[58]    247 byte
[59]    139 byte
[60]    247 byte
[61]    139 byte
[62]    32  byte
[63]    0   byte
[64]    247 byte
[65]    139 byte
[66]    32  byte
[67]    0   byte
[68]    70  byte
[69]    0   byte
[70]    32  byte
[71]    0   byte
[72]    247 byte
[73]    139 byte
[74]    32  byte
[75]    0   byte
[76]    247 byte
[77]    139 byte
[78]    247 byte
[79]    139 byte
[80]    247 byte
[81]    139 byte
[82]    247 byte
[83]    139 byte
[84]    32  byte
[85]    0   byte
[86]    247 byte
[87]    139 byte
[88]    32  byte
[89]    0   byte
[90]    10  byte
[91]    0   byte
[92]    49  byte
[93]    0   byte
[94]    32  byte
[95]    0   byte
[96]    247 byte
[97]    139 byte
[98]    32  byte
[99]    0   byte
[100]   50  byte
[101]   0   byte
[102]   32  byte
[103]   0   byte
[104]   247 byte
[105]   139 byte
[106]   247 byte
[107]   139 byte
[108]   32  byte
[109]   0   byte
[110]   51  byte
[111]   0   byte
[112]   32  byte
[113]   0   byte
[114]   247 byte
[115]   139 byte
[116]   247 byte
[117]   139 byte
[118]   32  byte
[119]   0   byte
[120]   247 byte
[121]   139 byte
[122]   32  byte
[123]   0   byte
[124]   52  byte
[125]   0   byte
[126]   32  byte
[127]   0   byte
[128]   247 byte
[129]   139 byte
[130]   247 byte
[131]   139 byte
[132]   32  byte
[133]   0   byte
[134]   247 byte
[135]   139 byte
[136]   247 byte
[137]   139 byte
[138]   32  byte
[139]   0   byte
[140]   53  byte
[141]   0   byte
[142]   32  byte
[143]   0   byte
[144]   247 byte
[145]   139 byte
[146]   247 byte
[147]   139 byte
[148]   32  byte
[149]   0   byte
[150]   247 byte
[151]   139 byte
[152]   247 byte
[153]   139 byte
[154]   247 byte
[155]   139 byte
[156]   32  byte
[157]   0   byte
[158]   54  byte
[159]   0   byte
[160]   32  byte
[161]   0   byte
[162]   247 byte
[163]   139 byte
[164]   247 byte
[165]   139 byte
[166]   32  byte
[167]   0   byte
[168]   247 byte
[169]   139 byte
[170]   247 byte
[171]   139 byte
[172]   247 byte
[173]   139 byte
[174]   247 byte
[175]   139 byte
[176]   32  byte
[177]   0   byte
[178]   10  byte
[179]   0   byte
[180]   247 byte
[181]   139 byte
[182]   32  byte
[183]   0   byte
[184]   247 byte
[185]   139 byte
[186]   247 byte
[187]   139 byte
[188]   32  byte
[189]   0   byte
[190]   247 byte
[191]   139 byte
[192]   32  byte
[193]   0   byte
[194]   247 byte
[195]   139 byte
[196]   247 byte
[197]   139 byte
[198]   32  byte
[199]   0   byte
[200]   247 byte
[201]   139 byte
[202]   247 byte
[203]   139 byte
[204]   247 byte
[205]   139 byte
[206]   32  byte
[207]   0   byte
[208]   247 byte
[209]   139 byte
[210]   32  byte
[211]   0   byte
[212]   247 byte
[213]   139 byte
[214]   247 byte
[215]   139 byte
[216]   247 byte
[217]   139 byte
[218]   247 byte
[219]   139 byte
[220]   32  byte
[221]   0   byte
[222]   247 byte
[223]   139 byte
[224]   32  byte
[225]   0   byte
[226]   247 byte
[227]   139 byte
[228]   247 byte
[229]   139 byte
[230]   32  byte
[231]   0   byte
[232]   247 byte
[233]   139 byte
[234]   247 byte
[235]   139 byte
[236]   247 byte
[237]   139 byte
[238]   32  byte
[239]   0   byte
[240]   247 byte
[241]   139 byte
[242]   32  byte
[243]   0   byte

Take the result and use regex replace to strip out any whitespace (tabs, spaces etc) such as:

string text = "A 请 B 请请 C 请 请请 D 请 请请请 E 请 请请请 请 F 请";

Console.WriteLine (Regex.Replace(text, @"\s+", string.Empty));

// Result printed out
// A请B请请C请请请D请请请请E请请请请请F请

This question (and answer) are a bit old, but I'll add my solution here.

In the version of iTextSharp I'm using here in 2021 (5.5.13.2), using PdfTextExtractor.GetTextFromPage(); doesn't even work anymore. I was processing a Japanese PDF, and all it extracted was linebreaks (/n) and bullets.

I was able to successfully extract text using this code (the same as the first 2 lines of code in the original question):

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

The only thing was, I had to change the font used in the PDF to Meiryo UI font. If you use Meiryo UI font, iTextSharp's GetTextFromPage will successfully extract Chinese characters (as well as Japanese kanji, katakana and hiragana).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM