简体   繁体   English

如何解析大型 DOCX 文件并挑选出在 python 中出现 n 次的关键字/字符串?

[英]How can I parse a large DOCX file and pick out key words/strings that appear n number of times in python?

I have very large DOCX files that I was hoping to parse through and be able to build a database of sorts that shows the frequency of a word/string in the documents.我有非常大的 DOCX 文件,我希望能够解析这些文件,并能够构建一个显示文档中单词/字符串频率的数据库。 From what I gather this is definitely not an easy task.据我所知,这绝对不是一件容易的事。 I was just hoping for some direction as to a library that I could use to help me with this.我只是希望有一个关于我可以用来帮助我解决这个问题的图书馆的方向。

在此处输入图片说明

This is an example of what one may look like.这是一个可能看起来像的例子。 The structure isn't consistent so that will complicate things as well.结构不一致,因此也会使事情复杂化。 Any direction will be appreciated!!!任何方向将不胜感激!!!

Python based solution基于 Python 的解决方案

If (as per your comment) you're able to do this in Python, look at the following snippets:如果(根据您的评论)您可以在 Python 中执行此操作,请查看以下代码段:

So first thing to realise is that docx files are actually .zip archives containing a number of XML files.所以首先要意识到的是,docx 文件实际上是包含许多 XML 文件的 .zip 档案。 Most text-content will be stored in the word/document.xml .大多数文本内容将存储在word/document.xml Word does some complicated things with numbered lists, which will require you to also load other XMLs like styles.xml . Word 使用编号列表执行一些复杂的操作,这将要求您还加载其他 XML,如styles.xml

The markup of DOCX files can be a pain as the document is structured in w:p (paragraphs) and arbitrary w:r (runs). DOCX 文件的标记可能很麻烦,因为文档是以 w:p(段落)和任意 w:r(运行)结构的。 These runs are basically 'a bit of typing', so it can either be one letter, or a couple of words together.这些运行基本上是“有点打字”,所以它可以是一个字母,也可以是几个单词。

We use UpdateableZipFile from https://stackoverflow.com/a/35435548 .我们使用来自https://stackoverflow.com/a/35435548 的UpdateableZipFile。 This was primarily because we also wanted to be able to edit the documents, so you could potentially just use snippets from it.这主要是因为我们还希望能够编辑文档,因此您可以只使用其中的片段。

import UpdateableZipFile
from lxml import etree

source_file = UpdateableZipFile(os.path.join(path, self.input_file))
nsmap = {'w': "http://schemas.openxmlformats.org/wordprocessingml/2006/main",
         'mc': "http://schemas.openxmlformats.org/markup-compatibility/2006",
        } #you might need a few more namespace definitions if you get funky docx inputs

document = source_file.read_member('word/document.xml') #returns the root of an Etree object based on the document.xml xml tree.

# Query the XML element using xpaths (don't use Regex), this gives the text of all paragraph nodes:
paragraph_list = document.xpath("//w:p/descendant-or-self::*/text()", namespaces=self.nsmap)

You can then feed the text to NLP such as Spacy:然后,您可以将文本提供给 NLP,例如 Spacy:

import spacy

nlp = spacy.load("en_core_web_sm")
word_counts = {}

for paragraph in paragraph_list:
    doc = nlp(paragraph)
    for token in doc:
        if token.text in word_counts:
            word_counts[token.text]+=1
        else:
            word_counts[token.text]=1    

Spacy will tokenize the text for you, and can do lots more in terms of Named Entity Recognition, Parts of Speech tagging etc. Spacy将为您标记文本,并且可以在命名实体识别、词性标记等方面做更多的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何选择记录号最高的阵列号? - How can I pick out the array number with the highest record number? 如何从 C# 中的字符串中提取一个数字 - How can I pick out a number from a string in C# 如果您有字符串字典,那么搜索文件和增加字符串出现次数的最快方法是什么? - If you have a dictionary of strings, what's the fastest way to search a file and increment the number of times the strings appear? 如何基于关键字获取.docx文件 - How to get .docx files based on key words 当用户上传单词文档(.doc / .docx)时,如何计算单词文档中的单词数? - How do I count the number of words in a word document (.doc / .docx) when a user uploads it? 如何使用C#从列表中选择随机对象? - How can I pick random objects out of a list with C#? 我如何使用Regex从html文件获取希伯来语字符串/单词? - How can i use Regex to get hebrew strings/words from html file? 如何使用 Sprache 解析可以以任何顺序出现的行? - How can I parse lines that can appear in any order with Sprache? 查询大量远程网络PC时,如何在foreach循环中超时? - When querying a large number of remote network PCs how can I time out in a foreach loop? 我应该如何用科学记数法(+301)解析一个非常大的数字? - How should I parse a very large number with scientific notation (+301)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM