简体   繁体   English

如何从 Word doc / docx 中提取文本并使用 PHP 执行逻辑

[英]How to extract text from a Word doc / docx and perform logic using PHP

My goal is to read the uploaded document and extract certain values like floats "1.20, 3.9", text.我的目标是读取上传的文档并提取某些值,如浮点数“1.20、3.9”、文本。 I have tried a few libraries, but nothing seems to get the job done.我尝试了一些库,但似乎没有任何东西可以完成工作。

Also, the files will contain tables like structure most of the time which spits out the vertical lines of the borders as well.此外,大多数情况下,文件将包含类似结构的表格,这些表格也会吐出边界的垂直线。

What comes to mind is some heavy regex parsing logic...想到的是一些沉重的正则表达式解析逻辑......

Anyone with a suggestable solution ?有人有建议的解决方案吗?

For docx file, you can try this对于 docx 文件,你可以试试这个

 $zip = zip_open($file);

                if (!$zip || is_numeric($zip)) return false;

                while ($zip_entry = zip_read($zip)) {

                    if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

                    if (zip_entry_name($zip_entry) != "word/document.xml") continue;

                    $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
                    zip_entry_close($zip_entry);
                } // end while

                zip_close($zip);

It will extract the DOCX document into text, because you can't get with only using file_get_content from php它会将 DOCX 文档提取为文本,因为您无法仅使用 php 中的file_get_content

And for DOC file而对于 DOC 文件

if (($fh = fopen($file, 'rb')) !== false) {
                $headers = fread($fh, 0xA00);

                // read doc from 0 to 255 characters
                $n1 = (ord($headers[0x21C]) - 1);

                // read doc from 256 to 63743 characters
                $n2 = ((ord($headers[0x21D]) - 8) * 256);

                // read doc from 63744 to 16775423 characters
                $n3 = ((ord($headers[0x21E]) * 256) * 256);

                //read doc from 16775424 to 4294965504 characters
                $n4 = (((ord($headers[0x21F]) * 256) * 256) * 256);

                // Total length of text in the document
                $textLength = ($n1 + $n2 + $n3 + $n4);
                ini_set('memory_limit', '-1');
                $extracted_plaintext = fread($fh, $textLength);
            }

After you convert it to text, times to crawl your desired text to using REGEX将其转换为文本后,需要使用REGEX抓取所需文本的时间

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM