简体   繁体   English

Python textract ImportError

[英]Python textract ImportError

I have begun using the Python library textract to parse text from PowerPoint ( .pptx), Word documents ( .docx), and text files (*.txt). 我已经开始使用Python库textract来解析PowerPoint( .pptx),Word文档( .docx)和文本文件(* .txt)中的文本。 I wrote a simple script to test it. 我写了一个简单的脚本来测试它。

# Python textract test script
import textract
textract.process("H:\My Documents\Test.docx")

When I run it, either on the command line or in Idle, I get a traceback with the last few lines being: 当我在命令行或空闲状态下运行它时,我会得到回溯,最后几行是:

File: "C:...\\textract\\parsers\\docx_parser.py", line 1 in import docx2txt ImportError: No module named docx2txt 文件:“ C:... \\ textract \\ parsers \\ docx_parser.py”,导入docx2txt中的第1行ImportError:没有名为docx2txt的模块

I am using version 1.5.0, downloaded from https://pypi.python.org/pypi/textract . 我正在使用从https://pypi.python.org/pypi/textract下载的1.5.0版本。 I don't know why it would not include any dependencies. 我不知道为什么它不包含任何依赖项。 Will I have to install docx2txt and its subsequent dependencies? 我是否需要安装docx2txt及其后续依赖项? Why would the textract package not contain everything I need? 为什么textract软件包不包含我需要的一切?

I would recommend using pip install xxx to install the module. 我建议使用pip install xxx来安装模块。 That'll install it in the path that's usually looked up by python. 它将安装在通常由python查找的路径中。 It should also take care of dependencies. 它也应该照顾依赖性。

If you did manual installation or just extracted it to dinner folder then Set your path correctly, like described here How to add to the pythonpath in windows 7? 如果您是手动安装或只是将其解压缩到晚餐文件夹,则正确设置路径,如此处所述如何在Windows 7中添加至pythonpath? or Python - PYTHONPATH in linux Python-Linux中的PYTHONPATH

If you think you've set it correctly then then post it's value, pwd etc. 如果您认为设置正确,则发布其值,密码等。

textract does not automatically install the dependencies for all the file types it supports. textract不会自动为其支持的所有文件类型安装依赖项。 You selectively install the ones you're interested in. 您可以有选择地安装您感兴趣的产品。

While this is not as elegant as one might imagine it could be, it's the appropriate design choice here I think. 尽管这并不像人们想象的那么优雅,但我认为这是合适的设计选择。 Python doesn't have the ability to install dependencies on-demand, so the only alternative would be for textract to install all the dozen or more possible dependencies, which would tend to bloat your Python environment. Python没有按需安装依赖项的能力,因此唯一的选择是textract安装所有十几个或更多可能的依赖项,这往往会使您的Python环境膨胀。

So in this case, as Kashyap mentions, the appropriate action is: 因此,正如Kashyap所提到的,在这种情况下,适当的操作是:

pip install python-docx

and similar for any other file type dependencies you might need. 对于可能需要的任何其他文件类型依赖关系,也是如此。

This worked for me, 这对我有用

open the terminal and then type them as below, 打开终端,然后按如下所示键入它们,

python -m venv env 
source ./env/bin/activate
sudo apt update
sudo apt install python-pip && pip install --upgrade pip
sudo apt install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
pip install textract

if you face any errors try them below 如果您遇到任何错误,请在下面尝试

pip install https://pypi.python.org/packages/ce/c7/ab6cd0d00ddf8dc3b537cfb922f3f049f8018f38c88d71fd164f3acb8416/SpeechRecognition-3.6.3-py2.py3-none-any.whl
sudo apt install libpulse-dev
pip install textract

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM