I would like to convert a large batch of MS Word files into the plain text format. I have no idea how to do it in Python. I found the following code online. My path is local and all file names are like cx-xxx (ie c1-000, c1-001, c2-000, c2-001 etc.):
from docx import [name of file]
import io
import shutil
import os
def convertDocxToText(path):
for d in os.listdir(path):
fileExtension=d.split(".")[-1]
if fileExtension =="docx":
docxFilename = path + d
print(docxFilename)
document = Document(docxFilename)
textFilename = path + d.split(".")[0] + ".txt"
with io.open(textFilename,"c", encoding="utf-8") as textFile:
for para in document.paragraphs:
textFile.write(unicode(para.text))
path= "/home/python/resumes/"
convertDocxToText(path)
Convert docx to txt with pypandoc:
import pypandoc
# Example file:
docxFilename = 'somefile.docx'
output = pypandoc.convert_file(docxFilename, 'plain', outputfile="somefile.txt")
assert output == ""
See the official documentation here:
You can also use the library docx2txt in Python. Here's an example:
I use glob to iter over all DOCX files in the folder. Note: I use a little list comprehension on the original name in order to re-use it in the TXT filename.
If there's anything I've forgotten to explain, tag me and I'll edit it in.
import docx2txt
import glob
directory = glob.glob('C:/folder_name/*.docx')
for file_name in directory:
with open(file_name, 'rb') as infile:
outfile = open(file_name[:-5]+'.txt', 'w', encoding='utf-8')
doc = docx2txt.process(infile)
outfile.write(doc)
outfile.close()
infile.close()
print("=========")
print("All done!")`
GroupDocs.Conversion Cloud SDK for Python supports 50+ file formats conversion. Its free plan provides 150 free API calls monthly.
# Import module
import groupdocs_conversion_cloud
from shutil import copyfile
# Get your client_id and client_key at https://dashboard.groupdocs.cloud (free registration is required).
client_id = "xxxxx-xxxx-xxxx-xxxx-xxxxxxxx"
client_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(client_id, client_key)
try:
#Convert DOCX to txt
# Prepare request
request = groupdocs_conversion_cloud.ConvertDocumentDirectRequest("txt", "C:/Temp/sample.docx")
# Convert
result = convert_api.convert_document_direct(request)
copyfile(result, 'C:/Temp/sample.txt')
except groupdocs_conversion_cloud.ApiException as e:
print("Exception when calling get_supported_conversion_types: {0}".format(e.message))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.