简体   繁体   中英

how do i can extract datas from a docx file?

i want to find the number of paragraphs, sentences, words and uniq words in a docx file. i already installed python-docx and nltk. i tried many things but nothing worked and i'm out of ideas right now.

this, for exemple, gives me uniq letters instead of unique words:

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

letexte = getText('demo.docx')
#print(letexte)

dist = FreqDist(letexte)
vocab = dist.keys()

print(len(dist))
print(vocab)

anyways... i'm lost.

can you show how you'd do it with a random demo.docx with more than 4 pages? thank you

To fing unique words in text you can use simple python script, just pass result of your getText() to it and you will get the list with only unique items. From this list you can get the number of unique items applying len()

import re

...

def count_unique_words(text_string):
    word_list = re.split('; |, |\*|\n |\s', text_string)
    return list(dict.fromkeys(word_list))

...
print(len(count_unique_words(letexte))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM