簡體   English   中英

如何從 MS Word (.DOC) 文件中獲取摘要信息?

[英]How can I get the summary information from MS Word (.DOC) file?

我只需要 PHP 中的頁數屬性,而不只使用內置函數(不是框架和 COM)。 輸入是一個“舊”的文檔文件。

這是我所知道的,我發現了這個話題,我希望它能幫助你解決問題:

SummarayInformation 看起來像這樣,它被加密到文件代碼中:在此處輸入圖片說明

我找到了一些 C 文件,其中可以找到提取該數據的方法,但我很難理解。

#include <stdlib.h>
#include <stdio.h>

#include "wv_Base.h"
#include "wv_Common.h"
#include "wv.h"

#include "glib.h"
#include "ms-ole.h"
#include "ms-ole-summary.h"


/*
 * This is a simple example that take an ole file and prints some
 * information from the summaryinformation stream
 */


int main(int argc, char *argv[])
    {
    char *str = NULL;
    int ret = 0;
    short s = 0;
    long l = 0;

    MsOle *ole = NULL;
    MsOleSummary *summary = NULL;

    if (argc < 2)
        {
        fprintf(stderr, "Usage: wvSummary oledocument\n");
        return(1);
        }

    ms_ole_open(&ole, argv[1]);
    if (!ole)
        {
        fprintf(stderr,"sorry problem with getting ole streams from %s\n",argv[1]);
        return 1;
        }

    summary = ms_ole_summary_open(ole);
    if (!summary)
        {
        fprintf(stderr, "Could not open summary stream\n");
        return 1;
        }

    ms_ole_summary_get_string(summary, , &ret);

    if (ret)
      printf("PageCount is %d\n", l);
    else
      printf("no pagecount\n");


    ms_ole_summary_close(summary);
    ms_ole_destroy(&ole);

    return 0;
    }

關於 MS_OLE_SUMMARY_TITLE:

/**
 * ms-ole-summary.h: MS Office OLE support
 *
 * Author:
 *    Michael Meeks (michael@imaginator.com)
 * From work by:
 *    Caolan McNamara (Caolan.McNamara@ul.ie)
 * Built on work by:
 *    Somar Software's CPPSUM (http://www.somar.com)
 *
 * Copyright 1998-2000 Helix Code, Inc., Frank Chiulli, and others.
 **/

#ifndef MS_OLE_SUMMARY_H
#define MS_OLE_SUMMARY_H

#include <time.h>
#include <libole2/ms-ole.h>

/*
 * MS Ole Property Set IDs
 * The SummaryInformation stream contains the SummaryInformation property set.
 * The DocumentSummaryInformation stream contains both the
 * DocumentSummaryInformation and the UserDefined property sets as sections.
 */
typedef enum {
    MS_OLE_PS_SUMMARY_INFO,
    MS_OLE_PS_DOCUMENT_SUMMARY_INFO,
    MS_OLE_PS_USER_DEFINED_SUMMARY_INFO
} MsOlePropertySetID;

typedef struct {
    guint8          class_id[16];
    GArray *        sections;
    GArray *        items;
    GList *         write_items;
    gboolean        read_mode;
    MsOleStream *       s;
    MsOlePropertySetID  ps_id;
} MsOleSummary;

/* Could store the FID, but why bother ? */
typedef struct {
    guint32         offset;
    guint32         props;
    guint32         bytes;
    MsOlePropertySetID  ps_id;
} MsOleSummarySection;

MsOleSummary *ms_ole_summary_open       (MsOle *f);
MsOleSummary *ms_ole_docsummary_open        (MsOle *f);
MsOleSummary *ms_ole_summary_open_stream    (MsOleStream *stream,
                         const MsOlePropertySetID psid);
MsOleSummary *ms_ole_summary_create     (MsOle *f);
MsOleSummary *ms_ole_docsummary_create      (MsOle *f);
MsOleSummary *ms_ole_summary_create_stream  (MsOleStream *s,
                         const MsOlePropertySetID psid);
GArray       *ms_ole_summary_get_properties (MsOleSummary *si);
void          ms_ole_summary_close      (MsOleSummary *si);


/*
 * Can be used to interrogate a summary item as to its type
 */
typedef enum {
    MS_OLE_SUMMARY_TYPE_STRING  = 0x10,
    MS_OLE_SUMMARY_TYPE_TIME    = 0x20,
    MS_OLE_SUMMARY_TYPE_LONG    = 0x30,
    MS_OLE_SUMMARY_TYPE_SHORT   = 0x40,
    MS_OLE_SUMMARY_TYPE_BOOLEAN = 0x50,
    MS_OLE_SUMMARY_TYPE_OTHER   = 0x60
} MsOleSummaryType;

#define MS_OLE_SUMMARY_TYPE(x) ((MsOleSummaryType)((x)>>8))

/* FIXME MS_OLE_SUMMARY_THUMBNAIL is Preview, no Security, isn't it? */
/*
 *  The MS byte specifies the type, the LS byte is the
 * 'standard' MS PID.
 */
typedef enum {
/* SummaryInformation Stream Properties */
/* String properties */
    MS_OLE_SUMMARY_TITLE          = 0x1002,
    MS_OLE_SUMMARY_SUBJECT        = 0x1003,
    MS_OLE_SUMMARY_AUTHOR         = 0x1004,
    MS_OLE_SUMMARY_KEYWORDS       = 0x1005,
    MS_OLE_SUMMARY_COMMENTS       = 0x1006,
    MS_OLE_SUMMARY_TEMPLATE       = 0x1007,
    MS_OLE_SUMMARY_LASTAUTHOR     = 0x1008,
    MS_OLE_SUMMARY_REVNUMBER      = 0x1009,
    MS_OLE_SUMMARY_APPNAME        = 0x1012,

/* Time properties */
    MS_OLE_SUMMARY_TOTAL_EDITTIME = 0x200A,
    MS_OLE_SUMMARY_LASTPRINTED    = 0x200B,
    MS_OLE_SUMMARY_CREATED        = 0x200C,
    MS_OLE_SUMMARY_LASTSAVED      = 0x200D,

/* Long integer properties */
    MS_OLE_SUMMARY_PAGECOUNT      = 0x300E,
    MS_OLE_SUMMARY_WORDCOUNT      = 0x300F,
    MS_OLE_SUMMARY_CHARCOUNT      = 0x3010,
    MS_OLE_SUMMARY_SECURITY       = 0x3013,

/* Short integer properties */
    MS_OLE_SUMMARY_CODEPAGE       = 0x4001,

/* Security */  
    MS_OLE_SUMMARY_THUMBNAIL      = 0x6011,


/* DocumentSummaryInformation Properties */
/* String properties */
    MS_OLE_SUMMARY_CATEGORY       = 0x1002,
    MS_OLE_SUMMARY_PRESFORMAT     = 0x1003,
    MS_OLE_SUMMARY_MANAGER        = 0x100E,
    MS_OLE_SUMMARY_COMPANY        = 0x100F,

/* Long integer properties */
    MS_OLE_SUMMARY_BYTECOUNT      = 0x3004,
    MS_OLE_SUMMARY_LINECOUNT      = 0x3005,
    MS_OLE_SUMMARY_PARCOUNT       = 0x3006,
    MS_OLE_SUMMARY_SLIDECOUNT     = 0x3007,
    MS_OLE_SUMMARY_NOTECOUNT      = 0x3008,
    MS_OLE_SUMMARY_HIDDENCOUNT    = 0x3009,
    MS_OLE_SUMMARY_MMCLIPCOUNT    = 0X300A,

/* Boolean properties */
    MS_OLE_SUMMARY_SCALE          = 0x500B,
    MS_OLE_SUMMARY_LINKSDIRTY     = 0x5010
} MsOleSummaryPID;


/* bit masks for security long integer */
#define MsOleSummaryAllSecurityFlagsEqNone        0x00
#define MsOleSummarySecurityPassworded            0x01
#define MsOleSummarySecurityRORecommended         0x02
#define MsOleSummarySecurityRO                    0x04
#define MsOleSummarySecurityLockedForAnnotations  0x08

typedef struct {
    GTimeVal time;
    GDate    date;
} MsOleSummaryTime;

typedef struct {
    guint32 len;
    guint8 *data;
} MsOleSummaryPreview;

gchar *         ms_ole_summary_get_string   (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
gboolean        ms_ole_summary_get_boolean  (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
guint16         ms_ole_summary_get_short    (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
guint32         ms_ole_summary_get_long     (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
GTimeVal        ms_ole_summary_get_time     (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
MsOleSummaryPreview ms_ole_summary_get_preview  (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
void            ms_ole_summary_preview_destroy  (MsOleSummaryPreview d);

/* FIXME The next comment isn't true, is it?
   Return TRUE if write is successful */
void            ms_ole_summary_set_string   (MsOleSummary *si,
                             MsOleSummaryPID id,
                             const gchar *str);
void            ms_ole_summary_set_boolean  (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean value);
void            ms_ole_summary_set_short    (MsOleSummary *si,
                             MsOleSummaryPID id,
                             guint16 i);
void            ms_ole_summary_set_long     (MsOleSummary *si,
                             MsOleSummaryPID id,
                             guint32 i);
void            ms_ole_summary_set_time     (MsOleSummary *si,
                             MsOleSummaryPID id,
                             GTimeVal time);
void            ms_ole_summary_set_preview  (MsOleSummary *si,
                             MsOleSummaryPID id,
                             const
                             MsOleSummaryPreview *
                             preview);

#endif  /* MS_OLE_SUMMARY_H */

msOle 結構:

/**
 * Structure describing an OLE file
 **/
struct _MsOle {
    int               ref_count;
    gboolean          ole_mmap;
    guint8           *mem;
    guint32           length;
    MsOleSysWrappers *syswrap;

    char              mode;
    int               file_des;
    int               dirty;
    GArray           *bb;      /* Big  blocks status  */
    GArray           *sb;      /* Small block status  */
    GArray           *sbf;     /* The small block file */
    guint32           num_pps; /* Count of number of property sets */
    GList            *pps;     /* Property Storage -> struct _PPS, always 1 valid entry or NULL */
/* if memory mapped */
    GPtrArray        *bbattr;  /* Pointers to block structures */
/* end if memory mapped */
};

其他資源:

http://slackware.mirrors.pair.com/slackware-8.1/source/gnome/libole2/libole2-0.2.4.tar.bz2 ftp://ftp.ca.com/caproducts/Opal/jasmine064/framework/include /

參考: http : //wvware.sourceforge.net/libole2/libole2.html

我已經厭倦了這種方式 - 但我沒有找到頁數:

echo("<pre>");
$file = "files/doctest.doc";
if(!is_file($file))die("File not found.");

//bind file to a stream.
$handle = fopen($file, "rb");

//read file content
$content = fread($handle, filesize($file));

$binaryfile = "";
for ($i = 0; $i < strlen($content); $i++) {
    //get ascii char
    $char = $content[$i];

    //get the acsii value 0-255 (2^8)
    $decimal = ord($char);

    //decimal number in base 200
    $binary =  base_convert($decimal, 10, 2);

    echo($char);

    echo sprintf(" %3d %08b",$decimal,$decimal);
    if($i % 4==0)echo("*");

    $bit32 = b($content[$i]).b($content[$i+1]).b($content[$i+2]).b($content[$i+3]);
    echo sprintf("<br><b>%d</b>",base_convert($bit32,2,10)); //32bit int


    echo("<br>");
}
fclose($handle);

謝謝你的幫助!

word文檔是一種格式非常復雜的文件。 該文件位於包含在 Windows 復合二進制文件中的流中

規范要求了解二進制(小端字節序)和 FAT(因為它使用格式中的 FAT)和所有其他類型的知識。

不使用 COM

我假設您不在 Windows 中(或者您現在已經使用了 COM/OLE),所以這里有一個程序可以准備和操作 Windows CDF 文件。 它不是一個框架,而是一個您可以使用system("cdfprogram file.doc")內置 php 函數調用的程序。

另一個准備好word文件的程序

同樣在這里,您安裝並使用system()或其任何等效的兄弟姐妹進行調用。

為什么 Microsoft Office 文件格式如此復雜?

由於以下原因:

  1. 它們被設計為在非常舊的計算機上運行速度很快。
  2. 它們旨在使用庫。
  3. 它們的設計沒有考慮到互操作性。
  4. 它們必須反映應用程序的所有復雜性。
  5. 它們必須反映應用程序的歷史。

參考: http : //www.joelonsoftware.com/items/2008/02/19.html

結論

沒有簡單的方法可以僅使用 PHP 內置函數從 word 文件中獲取頁數。 您必須閱讀 Microsoft 的所有規范並自己構建解析器。 這是一個單獨的項目。 我認為沒有人會免費為你做這件事。

為什么沒人試? 可能是因為沒有人願意在已經有庫和框架完成這項工作的情況下投入那么多時間。 這是我的看法。

建議

您如何創建一個在 Windows 機器上運行的 Web 服務(您將可以訪問 COM 庫)並且您的主應用程序可以簡單地將 word 文件發布到您的 Windows Web 服務,並且您的 Web 服務將頁數返回到您的主應用程序。 使用COM,就這么簡單。

您可以異步執行此操作,這樣您的上傳就不會減慢,並且在等待 Web 服務返回答案時,上傳可以處於“待驗證”狀態。

如果您使用的是 Web 服務,則它不必與 PHP 本身在同一台服務器上。

Web 服務將執行以下操作:

<?php

$word = new COM("word.application");
if (!$word) {
  echo ("Could not initialise MS Word object.\n"); 
  exit(1);
}
$word->Documents->Open(realpath("C:\\Test\\t.doc")); 

$pages = $word->ActiveDocument->BuiltInDocumentProperties(14); 
echo "Number of pages: " . $pages->value;

$word->ActiveDocument->Close(false); 
$word->Quit(); 
$word = null; 
unset($word);

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM