[英]Extracting data from MS Word
I am looking for a way to extract / scrape data from Word files into a database. 我正在寻找一种将Word文件中的数据提取/抓取到数据库中的方法。 Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
我们的公司程序在MS Word文件中有与客户的会议纪要,主要是由于历史和惯性。
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed. 我希望能够将这些会议记录中的操作项放入数据库中,以便我们可以从Web界面访问它们,将它们转换为任务并在完成时进行更新。
Which is the best way to do this: 最好的方法是这样做:
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python. 最后一个对我很有吸引力,因为Web界面是使用Django构建的,但是我从未使用过win32com或尝试用python编写Word脚本。
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. 编辑:我已经开始使用VBA提取文本,因为它使处理Word对象模型更加容易。 I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string.
但是,我遇到了一个问题-所有文本都在表格中,当我从所需的单元格中拉出字符串时,每个字符串的末尾都有一个奇怪的小方框字符。 My code looks like:
我的代码如下:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? 小控制字符框怎么了? Is some kind of character code coming across from Word?
Word是否会遇到某种字符代码?
Word has a little marker thingy that it puts at the end of every cell of text in a table. Word带有一点标记,它放在表格中每个文本单元的末尾。
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph. 就像段落中的段落末尾标记一样使用它:存储整个段落的格式。
Just use the Left() function to strip it out, ie 只需使用Left()函数将其剥离即可,即
Left(Target, Len(Target)-1))
By the way, instead of 顺便说一句
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this: 尝试这个:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text
Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. 好吧,我从未编写过Word脚本,但是使用win32com进行简单的工作非常容易。 Something like:
就像是:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. 这未经测试,但我认为类似的方法只是打开文件并将其保存为纯文本(只要您可以找到正确的文件格式)即可–然后,您可以将文本读入python并从那里进行操作。 There is probably a way to grab the contents of the file directly, too, but I don't know it off hand;
也可能有一种直接获取文件内容的方法,但是我不知道该如何使用。 documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
文档可能很难找到,但是如果您有VBA文档或经验,则应该能够随身携带。
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; 看看前段时间的这篇文章: http ://mail.python.org/pipermail/python-list/2002-October/168785.html向下滚动到COMTools.py; there's some good examples there.
那里有一些很好的例子。
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation. 您还可以运行makepy.py(pythonwin发行版的一部分)为可用的COM函数生成python“签名”,然后将其作为一种文档进行浏览。
You could use OpenOffice. 您可以使用OpenOffice。 It can open word files, and also can run python macros.
它可以打开Word文件,还可以运行python宏。
我想说一下右侧的相关问题-> 顶部的似乎对使用python路线有一些好主意。
how about saving the file as xml. 如何将文件另存为xml。 then using python or something else and pull the data out of word and into the database.
然后使用python或其他方法将数据从word中提取出来并放入数据库中。
It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. 可以以编程方式将Word文档另存为HTML,并将包含的表导入Access。 This requires very little effort.
这需要很少的努力。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.