简体   繁体   中英

How do I write a python script that can read doc/docx files and convert them to txt?

Basically I have a folder with plenty of .doc/.docx files. I need them in .txt format. The script should iterate over all the files in a directory, convert them to .txt files and store them in another folder.

How can I do it?

Does there exist a module that can do this?

I figured this would make an interesting quick programming project. This has only been tested on a simple .docx file containing "Hello, world!", but the train of logic should give you a place to work from to parse more complex documents.

from shutil import copyfile, rmtree
import sys
import os
import zipfile
from lxml import etree

# command format: python3 docx_to_txt.py Hello.docx

# let's get the file name
zip_dir = sys.argv[1]
# cut off the .docx, make it a .zip
zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
# make a copy of the .docx and put it in .zip
copyfile(zip_dir, zip_dir_zip_ext)
# unzip the .zip
zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
zip_ref.extractall('./temp')
# get the xml out of /word/document.xml
data = etree.parse('./temp/word/document.xml')
# we'll want to go over all 't' elements in the xml node tree.
# note that MS office uses namespaces and that the w must be defined in the namespaces dictionary args
# each :t element is the "text" of the file. that's what we're looking for
# result is a list filled with the text of each t node in the xml document model
result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
# dump result into a new .txt file
with open(os.path.splitext(zip_dir)[0]+'.txt', 'w') as txt:
    # join the elements of result together since txt.write can't take lists
    joined_result = '\n'.join(result)
    # write it into the new file
    txt.write(joined_result)
# close the zip_ref file
zip_ref.close()
# get rid of our mess of working directories
rmtree('./temp')
os.remove(zip_dir_zip_ext)

I'm sure there's a more elegant or pythonic way to accomplish this. You'll need to have the file you want to convert in the same directory as the python file. Command format is python3 docx_to_txt.py file_name.docx

conda install -c conda-forge python-docx

from docx import Document doc = Document(file)

for p in doc.paragrafs: print(p.text) pass

Thought I would share my approach, basically boils down to two commands that convert either .doc or .docx to a string, both options require a certain package:

import docx
import os
import glob
import subprocess
import sys

# .docx (pip3 install python-docx)
doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
# .doc (apt-get install antiword)
doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")

I then wrap these solutions up in a function, that can either return the result as a python string, or write to a file (with the option of appending or replacing).

import docx
import os
import glob
import subprocess
import sys

def doc2txt(infile, outfile, return_string=False, append=False):
    if os.path.exists(infile):
        if infile.endswith(".docx"):
            try:
                doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
            except Exception as e:
                print("Exception in converting .docx to str: ", e)
                return None
        elif infile.endswith(".doc"):
            try:
                doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
            except Exception as e:
                print("Exception in converting .docx to str: ", e)
                return None
        else:
            print("{0} is not .doc or .docx".format(infile))
            return None

        if return_string == True:
            return doctext
        else:
            writemode = "a" if append==True else "w"
            with open(outfile, writemode) as f:
                f.write(doctext)
                f.close()
    else:
        print("{0} does not exist".format(infile))
        return None

I then would call this function via something like:

files = glob.glob("/path/to/filedir/**/*.doc*", recursive=True)
outfile = "/path/to/out.txt"
for file in files:
    doc2txt(file, outfile, return_string=False, append=True)

It's not often I need to perform this operation, but up until now the script has worked for all my needs, if you find this function has a bug let me know in a comment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM