简体   繁体   English

从MS Word提取数据

[英]Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. 我正在寻找一种将Word文件中的数据提取/抓取到数据库中的方法。 Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia. 我们的公司程序在MS Word文件中有与客户的会议纪要,主要是由于历史和惯性。

I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed. 我希望能够将这些会议记录中的操作项放入数据库中,以便我们可以从Web界面访问它们,将它们转换为任务并在完成时进行更新。

Which is the best way to do this: 最好的方法是这样做:

  1. VBA macro from inside Word to create CSV and then upload to the DB? 从Word内部的VBA宏创建CSV,然后上传到DB?
  2. VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?) Word中的VBA宏与数据库的连接(如何从VBA连接到MySQL?)
  3. Python script via win32com then upload to DB? 通过win32com的Python脚本然后上传到数据库?

The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python. 最后一个对我很有吸引力,因为Web界面是使用Django构建的,但是我从未使用过win32com或尝试用python编写Word脚本。

EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. 编辑:我已经开始使用VBA提取文本,因为它使处理Word对象模型更加容易。 I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. 但是,我遇到了一个问题-所有文本都在表格中,当我从所需的单元格中拉出字符串时,每个字符串的末尾都有一个奇怪的小方框字符。 My code looks like: 我的代码如下:

sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum

num_rows = Application.ActiveDocument.Tables(2).Rows.Count

For n = 1 To num_rows
    Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
    Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
    Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
    If Target = "" Then
        ExportText = ""
    Else
        ExportText = Descr & Chr(44) & Assign & Chr(44) & _
            Target & Chr(13) & Chr(10)
        Print #fnum, ExportText
    End If
Next n

Close #fnum

What's up with the little control character box? 小控制字符框怎么了? Is some kind of character code coming across from Word? Word是否会遇到某种字符代码?

Word has a little marker thingy that it puts at the end of every cell of text in a table. Word带有一点标记,它放在表格中每个文本单元的末尾。

It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph. 就像段落中的段落末尾标记一样使用它:存储整个段落的格式。

Just use the Left() function to strip it out, ie 只需使用Left()函数将其剥离即可,即

 Left(Target, Len(Target)-1))

By the way, instead of 顺便说一句

 num_rows = Application.ActiveDocument.Tables(2).Rows.Count
 For n = 1 To num_rows
      Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text

Try this: 尝试这个:

 For Each row in Application.ActiveDocument.Tables(2).Rows
      Descr = row.Cells(2).Range.Text

Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. 好吧,我从未编写过Word脚本,但是使用win32com进行简单的工作非常容易。 Something like: 就像是:

from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?)  # not sure what to use for ?

This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. 这未经测试,但我认为类似的方法只是打开文件并将其保存为纯文本(只要您可以找到正确的文件格式)即可–然后,您可以将文本读入python并从那里进行操作。 There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; 也可能有一种直接获取文件内容的方法,但是我不知道该如何使用。 documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across. 文档可能很难找到,但是如果您有VBA文档或经验,则应该能够随身携带。

Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; 看看前段时间的这篇文章: http ://mail.python.org/pipermail/python-list/2002-October/168785.html向下滚动到COMTools.py; there's some good examples there. 那里有一些很好的例子。

You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation. 您还可以运行makepy.py(pythonwin发行版的一部分)为可用的COM函数生成python“签名”,然后将其作为一种文档进行浏览。

You could use OpenOffice. 您可以使用OpenOffice。 It can open word files, and also can run python macros. 它可以打开Word文件,还可以运行python宏。

我想说一下右侧的相关问题-> 顶部的似乎对使用python路线有一些好主意。

how about saving the file as xml. 如何将文件另存为xml。 then using python or something else and pull the data out of word and into the database. 然后使用python或其他方法将数据从word中提取出来并放入数据库中。

It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. 可以以编程方式将Word文档另存为HTML,并将包含的表导入Access。 This requires very little effort. 这需要很少的努力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM