简体   繁体   English

如何使用 Python 3.6 将任何格式的文件转换为文本格式?

[英]How I can convert file with any format to text format using Python 3.6?

I am trying to have a converter that can convert any file of any format to text, so that processing becomes easier to me.我正在尝试有一个转换器,可以将任何格式的任何文件转换为文本,这样处理对我来说变得更容易。 I have used the Python textract library.我使用了 Python textract库。
Here is the documentation: https://textract.readthedocs.io/en/stable/这是文档: https ://textract.readthedocs.io/en/stable/

I have install it using the pip and have tried to use it.我已经使用pip安装它并尝试使用它。 But got error and could not understand how to resolve it.但是出现错误,无法理解如何解决它。

>>> import textract
>>> text = textract.process('C:\Users\beta\Desktop\Projects Done With Specification.pdf', method='pdfminer')
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Even I have tried using the command without specifying method.即使我尝试使用该命令而不指定方法。

>>> import textract
>>> text = textract.process('C:\Users\beta\Desktop\Projects Done With Specification.pdf')
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Kindly let me know how I can get rid of this issue with your suggestion.请让我知道如何通过您的建议摆脱这个问题。 If it is possible then please suggest me the solution, if there is anything else that can be handy instead of textract , then still you can suggest me.如果可能的话,请向我建议解决方案,如果还有其他可以方便的代替textract ,那么您仍然可以建议我。 I would like to hear.我想听听。

The \ character means different things in different contexts. \字符在不同的上下文中意味着不同的东西。 In Windows pathnames, it is the directory separator.在 Windows 路径名中,它是目录分隔符。 In Python strings, it introduces escape sequences.在 Python 字符串中,它引入了转义序列。 When specifying paths, you have to account for this.指定路径时,您必须考虑到这一点。

Try any one of these:尝试以下任何一种:

text = textract.process('C:\\Users\\beta\\Desktop\\Projects Done With Specification.pdf', method='pdfminer')
text = textract.process(r'C:\Users\beta\Desktop\Projects Done With Specification.pdf', method='pdfminer')
text = textract.process('C:/Users/beta/Desktop/Projects Done With Specification.pdf', method='pdfminer')

The problem is with the string问题出在字符串上

'C:\Users\beta\Desktop\Projects Done With Specification.pdf'

The \U starts an eight-character Unicode escape, such as '\U00014321`. \U 开始一个八字符的 Unicode 转义,例如 '\U00014321`。 In your code, the escape is followed by the character 's', which is invalid.在您的代码中,转义符后跟字符“s”,这是无效的。

You either need to duplicate all backslashes, or prefix the string with r (to produce a raw string).您要么需要复制所有反斜杠,要么在字符串前面加上 r(以生成原始字符串)。

尝试encoding='utf-8'

textract.process('C:\Users\beta\Desktop\Projects Done With Specification.pdf', encoding='utf-8')

In your case, error is due to invalid path.在您的情况下,错误是由于路径无效。 Try this and it works: 'C:\Users\beta\Desktop\Projects Done With Specification.pdf' "OR" 'C:/Users/beta/Desktop/Projects Done With Specification.pdf'试试这个,它可以工作:'C:\Users\beta\Desktop\Projects Done With Specification.pdf' "OR" 'C:/Users/beta/Desktop/Projects Done With Specification.pdf'

import textract
text = textract.process(r'C:\Users\myname\Desktop\doc\an.docx', encoding='utf-8')

this worked for me.这对我有用。 Try.尝试。

textract doesn't work for me, when I was trying to convert slurm file output to text file.当我尝试将slurm文件输出转换为text文件时, textract对我不起作用。 But simple with open did.但简单with open做到了。

with open('disktest.o1761955', 'r') as f:
    txt = f.read()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM