[英]How can I parse a large DOCX file and pick out key words/strings that appear n number of times in python?
I have very large DOCX files that I was hoping to parse through and be able to build a database of sorts that shows the frequency of a word/string in the documents.我有非常大的 DOCX 文件,我希望能够解析这些文件,并能够构建一个显示文档中单词/字符串频率的数据库。 From what I gather this is definitely not an easy task.
据我所知,这绝对不是一件容易的事。 I was just hoping for some direction as to a library that I could use to help me with this.
我只是希望有一个关于我可以用来帮助我解决这个问题的图书馆的方向。
This is an example of what one may look like.这是一个可能看起来像的例子。 The structure isn't consistent so that will complicate things as well.
结构不一致,因此也会使事情复杂化。 Any direction will be appreciated!!!
任何方向将不胜感激!!!
If (as per your comment) you're able to do this in Python, look at the following snippets:如果(根据您的评论)您可以在 Python 中执行此操作,请查看以下代码段:
So first thing to realise is that docx files are actually .zip archives containing a number of XML files.所以首先要意识到的是,docx 文件实际上是包含许多 XML 文件的 .zip 档案。 Most text-content will be stored in the
word/document.xml
.大多数文本内容将存储在
word/document.xml
。 Word does some complicated things with numbered lists, which will require you to also load other XMLs like styles.xml
. Word 使用编号列表执行一些复杂的操作,这将要求您还加载其他 XML,如
styles.xml
。
The markup of DOCX files can be a pain as the document is structured in w:p (paragraphs) and arbitrary w:r (runs). DOCX 文件的标记可能很麻烦,因为文档是以 w:p(段落)和任意 w:r(运行)结构的。 These runs are basically 'a bit of typing', so it can either be one letter, or a couple of words together.
这些运行基本上是“有点打字”,所以它可以是一个字母,也可以是几个单词。
We use UpdateableZipFile from https://stackoverflow.com/a/35435548 .我们使用来自https://stackoverflow.com/a/35435548 的UpdateableZipFile。 This was primarily because we also wanted to be able to edit the documents, so you could potentially just use snippets from it.
这主要是因为我们还希望能够编辑文档,因此您可以只使用其中的片段。
import UpdateableZipFile
from lxml import etree
source_file = UpdateableZipFile(os.path.join(path, self.input_file))
nsmap = {'w': "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
'mc': "http://schemas.openxmlformats.org/markup-compatibility/2006",
} #you might need a few more namespace definitions if you get funky docx inputs
document = source_file.read_member('word/document.xml') #returns the root of an Etree object based on the document.xml xml tree.
# Query the XML element using xpaths (don't use Regex), this gives the text of all paragraph nodes:
paragraph_list = document.xpath("//w:p/descendant-or-self::*/text()", namespaces=self.nsmap)
You can then feed the text to NLP such as Spacy:然后,您可以将文本提供给 NLP,例如 Spacy:
import spacy
nlp = spacy.load("en_core_web_sm")
word_counts = {}
for paragraph in paragraph_list:
doc = nlp(paragraph)
for token in doc:
if token.text in word_counts:
word_counts[token.text]+=1
else:
word_counts[token.text]=1
Spacy will tokenize the text for you, and can do lots more in terms of Named Entity Recognition, Parts of Speech tagging etc. Spacy将为您标记文本,并且可以在命名实体识别、词性标记等方面做更多的事情。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.