简体   繁体   English

以编程方式访问Word 2007文档的文档属性

[英]Programmatically access document properties for Word 2007 documents

Is there a way in which I can programmatically access the document properties of a Word 2007 document? 有没有一种方法可以编程方式访问Word 2007文档的文档属性?

I am open to using any language for this, but ideally it might be via a PowerShell script. 我愿意为此使用任何语言,但理想情况下可能是通过PowerShell脚本。

My overall aim is to traverse the documents somewhere on a filesystem, parse some document properties from these documents, and then collate all of these properties back together into a new Word document. 我的总体目标是遍历文件系统上的某些文档,从这些文档中解析一些文档属性,然后将所有这些属性整理回一个新的Word文档。

I essentially want to automatically create a document which is a list of all documents beneath a certain folder of the filesystem; 我本质上想要自动创建一个文档,该文档是文件系统某个文件夹下的所有文档的列表; and this list would contain such things as the Title , Abstract and Author document properties; 此列表将包含标题摘要作者文档属性等内容; the CreateDate field; CreateDate字段; etc. for each document. 每个文件等。

I needed to do this in PowerShell running on a server without MS Office applications installed. 我需要在没有安装MS Office应用程序的服务器上运行的PowerShell中执行此操作。 The trick, as suggested above, is to peek inside the office file and examine the embedded xml files within. 如上所述,诀窍是窥视office文件并检查其中的嵌入式xml文件。

Here's a function that runs like a cmdlet, meaning you can simply save the script in your PowerShell scripts directory and call the function from any other PowerShell script. 这是一个像cmdlet一样运行的函数,这意味着您只需将脚本保存在PowerShell脚本目录中,并从任何其他PowerShell脚本调用该函数。

# DocumentOfficePropertiesGet
# Example usage
#   From a PowerShell script:
#       $props = Invoke-Expression "c:\PowerShellScriptFolder\DocumentOfficePropertiesGet.ps1 -DocumentFullPathName ""d:\documents\my excel doc.xlsx"" -OfficeProperties ""dcterms:created;dcterms:modified"""

# Parameters

#    DocumentFullPathName -- full path and name of MS Office document
#    OfficeProperties -- semi-colon delimited string of property names as they
#              appear in the core.xml file. To see these names, rename any
#              MS Office document file to have the extension .zip, then look inside
#              the zip file. In the docProps folder open the core.xml file. The
#              core document properties are nodes under the cp:coreProperties node.

#         Example: dcterms:created;dcterms:modified;cp:lastModifiedBy

# Return value

#   The function returns a hashtable object -- in the above example, $props would contain
#   the name-value pairs for the requested MS Office document properties. In the calling script,
#   to get at the values:

#        $fooProperty = $props.'dcterms:created'
#        $barProperty = $props.'dcterms:modified'

[CmdletBinding()]
    [OutputType([System.Collections.Hashtable])]
    Param
    (
        [Parameter(Position=0,
            Mandatory=$false,
            HelpMessage="Enter the full path name of the document")]
            [ValidateNotNullOrEmpty()]
            [String] $DocumentFullPathName='e:\temp\supplier_List.xlsx',
        [Parameter(Position=1,
            Mandatory=$false,
            HelpMessage="Enter the Office properties semi-colon delimited")]
            [ValidateNotNullOrEmpty()]
            [String] $OfficeProperties='dcterms:created; dcterms:modified ;cp:lastModifiedBy;dc:creator'
    )
# We need the FileSystem assembly
Add-Type -AssemblyName System.IO.Compression.FileSystem

# This function unzips a zip file -- and it works on MS Office files directly: no need to
# rename them from foo.xlsx to foo.zip. It expects the full path name of the zip file
# and the path name for the unzipped files
function Unzip
{
    param([string]$zipfile, [string]$outpath)

    [System.IO.Compression.ZipFile]::ExtractToDirectory($zipfile, $outpath) *>$null
}

# Remove spaces from the OfficeProperties parameter
$OfficeProperties = $OfficeProperties.replace(' ','')

# Compose the name of the folder where we will unzip files
$zipDirectoryName = $env:TEMP + "\" + "TempZip"

# delete the zip directory if present
remove-item $zipDirectoryName -force -recurse -ErrorAction Ignore | out-null

# create the zip directory
New-Item -ItemType directory -Path $zipDirectoryName | out-null

# Unzip the files -- i.e. extract the xml files embedded within the MS Office document
unzip $DocumentFullPathName $zipDirectoryName

# get the docProps\core.xml file as [xml]
$coreXmlName = $zipDirectoryName + "\docProps\core.xml"
[xml]$coreXml = get-content -path $coreXmlName

# create an array of the requested properties
$requiredProperties = $OfficeProperties -split ";"

# create a hashtable to return the values
$docProperties = @{}

# Now look for each requested property
foreach($requiredProperty in $requiredProperties)
{
    # We will be lazy and ignore the namespaces. We need the local name only
    $localName = $requiredProperty -split ":"
    $localName = $localName[1]
    # Use XPath to fetch the node for this property
    $thisNode = $coreXml.coreProperties.SelectSingleNode("*[local-name(.) = '$localName']")
    if($thisNode -eq $null)
    {
        # To the hashtable, add the requested property name and its value -- null in this case
        $docProperties.Add($RequiredProperty, $null)
    }
    else
    {
        # To the hashtable, add the requested property name and its value
        $docProperties.Add($RequiredProperty, $thisNode.innerText)
    }
}

#clean up
remove-item $zipDirectoryName -force -recurse

# return the properties hashtable. To do this, just write the object to the output stream
$docProperties

My guess is that your best bet is VB or C# and the Office Interop Assemblies . 我的猜测是你最好的选择是VB或C#以及Office Interop Assemblies I'm unaware of a native way (within Powershell) to do what you want. 我不知道本地方式(在Powershell中)做你想做的事。

That said, if you use VB or C#, you could write a powershell cmdlet to what you are the collation. 也就是说,如果你使用VB或C#,你可以编写一个PowerShell cmdlet来进行整理。 But at that point, it might be more simple to just write a console app that runs as a scheduled task instead. 但在那时,编写一个作为计划任务运行的控制台应用程序可能更简单。

I recently learned from watching a DNRTV episode that Office 2007 documents are just zipped XML. 我最近从观看DNRTV剧集中了解到Office 2007文档只是拉链XML。 Therefore, you can change "Document.docx" to "Document.docx.zip" and see the XML files within. 因此,您可以将“Document.docx”更改为“Document.docx.zip”并查看其中的XML文件。 You could probably get the properties via an interop assembly in .NET, but it may be more efficient to just look right into the XML (perhaps with LINQ to XML or some native way I am unaware of). 您可以通过.NET中的互操作程序集获取属性,但是直接查看XML可能更有效(可能使用LINQ to XML或一些我不知道的本机方式)。

I wrote up how to do this back in the Monad beta days. 我在Monad测试版的日子里写了如何做到这一点 It should still work I think. 我认为它应该仍然有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM