简体繁体 English

使用php从pdf文件读取数据

[英]Read data from pdf file using php

原文 2014-02-14 09:47:31 1 1 php/ pdf

I read pdf file using the code in this link http://webcheatsheet.com/php/reading_clean_text_from_pdf.php .My pdf file contain a table structure, but after reading the file all data messed up like xxx 333 Exited from Country 26-01-2014 yyy 444 Entered to Country 26-01-2014 zzz 555 Exited from Country 26-01-2014 我使用此链接http://webcheatsheet.com/php/reading_clean_text_from_pdf.php中的代码阅读pdf文件。我的pdf文件包含表格结构，但在读取文件后，所有数据都混乱了，例如xxx 333从国家26-01退出-2014 yyy 444进入国家26-01-2014 zzz 555退出国家26-01-2014

My pdf structure is 我的pdf结构是

visaNo Name date Type 签证编号名称日期类型

333 xxx 26-01-2014 Existed from country 333 xxx 26-01-2014来自国家/地区
444 yyy 26-01-2014 Entered to country 444 yyy 26-01-2014进入国家
555 zzz 26-01-2014 Existed from country 555 zzz 26-01-2014来自国家/地区

1 个解决方案

I'm afraid there's no easy fix for this. 恐怕没有简单的解决办法。 In the pdf source code the order in which you write down the bits of text is not important, since you can give all of them specific cohordinates. 在pdf源代码中，写下文本位的顺序并不重要，因为您可以给它们全部指定特定的坐标。 The two things you can do are (imho): 您可以做的两件事是（imho）：

[In any case] Change your code to give you the text streams separated (maybe replacing Tj and TJ with a special character of your choice, instead of throwing them away) [在任何情况下]更改代码以使文本流分开（也许用您选择的特殊字符替换Tj和TJ，而不是将它们丢弃）
If you are extremely sure that the software generating the pdf file acts consistently, switch by hand the columns (the cells will be separated by your special character) 如果您非常确定生成pdf文件的软件能够始终如一地运行，请手动切换列（单元格将由您的特殊字符分隔）
If not I'm afraid you will have to change your code a little bit more and take care of the cohordinates of all the text streams and guess from them the row# and the col# of your text stream. 如果不是这样，恐怕您将不得不对代码进行一些更改，并注意所有文本流的坐标，并从中猜测文本流的行号和列号。

I really hope for you there is some php class doing this 我真的希望你有一些php类这样做