简体   繁体   English

PHP从文本中提取数据

[英]PHP extracting data from text

I have an old windows 95 program that exports data without account numbers, seasonal accounts, and if accounts contains a sub account. 我有一个旧的Windows 95程序,该程序可以导出没有帐号,季节性帐号和帐号包含子帐号的数据。

I am, however, able to print customer information and notes that has the above information to a pdf file and copy that text to notepad; 但是,我可以将包含上述信息的客户信息和注释打印到pdf文件中,并将该文本复制到记事本中; which I would like to extract the data. 我想提取数据。

The order the data: 1) page headers (I do not need this data.) 排序数据:1)页面标题(我不需要此数据。)

Company Name 公司名

Customer Information and Notes 客户信息和注释

Computed Monday, August 10 2015 Page 1 计算日期为2015年8月10日,星期一第1页

2) standard titles and 3) the data after titles: 2)标准标题和3)标题后的数据:

Ser Name: Block, Sunny Route: 1 系列名称:街区,晴天路线: 1

Address: 3354 ASPEN RD. 地址: 3354 ASPEN RD。 Frequency: Monthly 频率:每月

Address: ST PETE, GA 33333 Week/Day: First Monday 地址: GA 33333 ST PETE 星期/日:第一个星期一

City State Zip: data Sched Time (HH:MM): 10:00A 城市州邮编:数据预定时间(HH:MM): 10:00A

Ser Phone: 555-1212 Service: BASIC SERVICE 服务 电话: 555-1212 服务:基本服务

Bill to: BLOCK,SUNNY Rate ($): 24.00 开单至: BLOCK,SUNNY 汇率($): 24.00

Company Name 公司名

Customer Information and Notes 客户信息和注释

Computed Monday, August 10 2015 Page 2 计算日期为2015年8月10日,星期一第2页

Address: 1123 Sligh Terms: CASH 地址: 1123 Sligh 条款:现金

Address: Apt B 地址: B座

notes: Sunny has a mean dog 注意: Sunny有一只卑鄙的狗

Do not enter unless dog is put up 除非养狗,否则不要进入

Then it loops to next customers data and so on. 然后,它循环到下一个客户数据,依此类推。

The main titles never change, such as, ser name, route, address, notes, phone. 主要标题永远不会改变,例如Ser名称,路线,地址,便笺,电话。 There is a set number of titles in order; 标题的顺序是固定的。 however, the title notes: can take 1 -16 lines; 但是,标题注释:可以使用1 -16行; and the header is random throughout the data. 标头在整个数据中都是随机的。 and although the titles are in order, address is titled 4 times for both service- line 1 and line 2 and billing addresses- line 1 and line 2. 尽管标题是按顺序排列的,但服务行1和行2以及帐单地址行1和行2的地址都被标题了4次。

I would like to set variables to these titles and only take what's after them; 我想为这些标题设置变量,只接受它们后面的内容; the extraction part through PHP. 通过PHP提取部分。 Is there anyway to do this? 反正有这样做吗?

I don't think it's possible for a perfect solution, but FWIW, maybe this is good enough for you. 我认为不可能有一个完美的解决方案,但是FWIW,也许这对您已经足够了。

Without a known / reliable delimiter between clients, I can't think of any good way you can get the notes without having the header stuff for the next company included, unless you can do something involving a big lookup table of all client names. 在客户端之间没有已知/可靠的分隔符的情况下,除非您可以进行涉及所有客户端名称的大查找表的操作,否则我想不出任何好方法就可以在不包含下一家公司的标头的情况下获取注释。

I do have (an ugly) regex that may reliably help as far as the other stuff though: 我确实有一个(丑陋的)正则表达式,尽管它可以对其他东西提供可靠的帮助:

$content='[the contents of your file]';
preg_match_all('~(Ser Name|Route|Address|Frequency|Week/Day|City State Zip|Sched Time \(HH:MM\)|Ser Phone|Service|Bill to|Rate \(\$\)|Terms|notes):\s*((?:(?!Ser Name|Route|Address|Frequency|Week/Day|City State Zip|Sched Time \(HH:MM\)|Ser Phone|Service|Bill to|Rate \(\$\)|Terms|notes).)+)~is',$content,$matches);

So this basically looks for the "header" and puts into first captured group, and then matches up to the next "header" and puts that into 2nd captured group. 因此,这基本上是在寻找“标头”并将其放入第一个捕获的组,然后匹配到下一个“标头”并将其放入第二个捕获的组。

Perhaps this is good enough for you, but TBH I can't think of anything better you can do, unless you can improve your extraction to a better format. 也许这对您已经足够了,但是TBH我无法想到您可以做得更好,除非您可以将提取方式改进为更好的格式。

So your example data would output: 因此,您的示例数据将输出:

Array
(
    [0] => Array
        (
            [0] => Ser Name: Block, Sunny 
            [1] => Route: 1


            [2] => Address: 3354 ASPEN RD. 
            [3] => Frequency: Monthly


            [4] => Address: ST PETE, GA 33333 
            [5] => Week/Day: First Monday


            [6] => City State Zip: data 
            [7] => Sched Time (HH:MM): 10:00A


            [8] => Ser Phone: 555-1212 
            [9] => Service: BASIC SERVICE


            [10] => Bill to: BLOCK,SUNNY 
            [11] => Rate ($): 24.00

Company Name

Customer Information and Notes

Computed Monday, August 10 2015 Page 2


            [12] => Address: 1123 Sligh 
            [13] => Terms: CASH


            [14] => Address: Apt B


            [15] => notes: Sunny has a mean dog
        )

    [1] => Array
        (
            [0] => Ser Name
            [1] => Route
            [2] => Address
            [3] => Frequency
            [4] => Address
            [5] => Week/Day
            [6] => City State Zip
            [7] => Sched Time (HH:MM)
            [8] => Ser Phone
            [9] => Service
            [10] => Bill to
            [11] => Rate ($)
            [12] => Address
            [13] => Terms
            [14] => Address
            [15] => notes
        )

    [2] => Array
        (
            [0] => Block, Sunny 
            [1] => 1


            [2] => 3354 ASPEN RD. 
            [3] => Monthly


            [4] => ST PETE, GA 33333 
            [5] => First Monday


            [6] => data 
            [7] => 10:00A


            [8] => 555-1212 
            [9] => BASIC SERVICE


            [10] => BLOCK,SUNNY 
            [11] => 24.00

Company Name

Customer Information and Notes

Computed Monday, August 10 2015 Page 2


            [12] => 1123 Sligh 
            [13] => CASH


            [14] => Apt B


            [15] => Sunny has a mean dog
        )

)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM