简体   繁体   English

如何使用cgi python脚本在浏览器中显示pdf文件内容及其全名?

[英]how to display pdf file contents as well as its full name in the browser using cgi python script?

I wish to display the full path of the pdf file along with its contents displayed on the browser. 我希望在浏览器中显示pdf文件的完整路径及其内容。 My script has an input html, where user will input file name and submit the form. 我的脚本有一个输入html,用户将在其中输入文件名并提交表单。 The script will search for the file, if found in the subdirectories will output the file contents into the browser and also display its name. 该脚本将搜索文件,如果在子目录中找到该文件,则会将文件内容输出到浏览器并显示其名称。 I am able to display the contents but unable to display the full fine name also simultaneously Or if I display the filename I get garbage character display for the contents. 我可以显示内容,但不能同时显示全名,或者如果显示文件名,则会显示内容的乱码。 Please guide. 请指导。

enter link description here 在此处输入链接说明

script a.py: 脚本a.py:

import os
import cgi
import cgitb 
cgitb.enable()
import sys
import webbrowser

def check_file_extension(display_file):
    input_file = display_file
    nm,file_extension = os.path.splitext(display_file)
    return file_extension

form = cgi.FieldStorage()

type_of_file =''
file_nm = ''
nm =''
not_found = 3

if form.has_key("file1"):
    file_nm = form["file1"].value

type_of_file = check_file_extension(file_nm)

pdf_paths = [ '/home/nancy/Documents/',]

# Change the path while executing on the server , else it will throw error 500
image_paths = [ '/home/nancy/Documents/']


if type_of_file == '.pdf':
    search_paths = pdf_paths
else:
    # .jpg
    search_paths = image_paths
for path in search_paths:
    for root, dirnames, filenames in os.walk(path):
        for f in filenames:
            if f == str(file_nm).strip():
                absolute_path_of_file = os.path.join(root,f)
                # print 'Content-type: text/html\n\n'
                # print '<html><head></head><body>'
                # print absolute_path_of_file
                # print '</body></html>'
#                 print """Content-type: text/html\n\n
# <html><head>absolute_path_of_file</head><body>
# <img src=file_display.py />
# </body></html>"""
                not_found = 2
                if  search_paths == pdf_paths:
                    print 'Content-type: application/pdf\n'
                else:
                    print 'Content-type: image/jpg\n'
                file_read = file(absolute_path_of_file,'rb').read()
                print file_read
                print 'Content-type: text/html\n\n'
                print absolute_path_of_file
                break
        break
    break

if not_found == 3:
    print  'Content-type: text/html\n'
    print '%s not found' % absolute_path_of_file

The html is a regular html with just 1 input field for file name. 该html是常规html,只有1个输入字段作为文件名。

It is not possible. 这不可能。 At least not that simple. 至少不是那么简单。 Some web browsers don't display PDFs but ask the user to download the file, some display them themselves, some embed an external PDF viewer component, some start an external PDF viewer. 某些Web浏览器不显示PDF,但要求用户下载文件,某些浏览器自己显示它们,某些嵌入外部PDF查看器组件,某些启动外部PDF查看器。 There is no standard, cross browser way to embed PDF into HTML, which would be needed if you want to display arbitrary text and the PDF content. 没有标准的,跨浏览器的方法将PDF嵌入HTML,如果您想显示任意文本 PDF内容,则将需要这种方法。

A fallback solution, working on every browser, would be rendering the PDF pages on the server as images and serve those to the client. 适用于每种浏览器的后备解决方案可以将服务器上的PDF页面呈现为图像,并将其提供给客户端。 This puts some stress on the server (processor, memory/disk for caching, bandwidth). 这给服务器(处理器,用于缓存的内存/磁盘,带宽)带来了一些压力。

Some modern, HTML5 capable browsers can render PDFs with Mozilla's pdf.js on a canvas element. 一些支持HTML5的现代浏览器可以在画布元素上使用Mozilla的pdf.js呈现PDF。

For other's you could try to use <embed> / <object> to use Adobe's plugin as described on Adobe's The PDF Developer Junkie Blog . 对于其他方面,您可以尝试使用<embed> / <object>来使用Adobe的插件, 如Adobe的PDF Developer Junkie Blog所述


Rendering the pages on the server 渲染服务器上的页面

Rendering and serving the PDF pages as images needs some software on the server to query the number of pages and to extract and render a given page as image. 将PDF页面作为图像呈现和提供服务需要服务器上的某些软件来查询页面数,并提取给定页面并将其呈现为图像。

The number of pages can be determined with the pdfinfo program from Xpdf or the libpoppler command line utilities. 可以使用Xpdfpdfinfo程序或libpoppler命令行实用工具确定页数。 Converting a page from the PDF file to a JPG image can be done with convert from the ImageMagick tools. 可以使用ImageMagick工具中的convert页面从PDF文件转换为JPG图像。 A very simple CGI program using these programs: 一个使用这些程序的非常简单的CGI程序:

#!/usr/bin/env python
import cgi
import cgitb; cgitb.enable()
import os
from itertools import imap
from subprocess import check_output

PDFINFO = '/usr/bin/pdfinfo'
CONVERT = '/usr/bin/convert'
DOC_ROOT = '/home/bj/Documents'

BASE_TEMPLATE = (
    'Content-type: text/html\n\n'
    '<html><head><title>{title}</title></head><body>{body}</body></html>'
)
PDF_PAGE_TEMPLATE = (
    '<h1>{filename}</h1>'
    '<p>{prev_link} {page}/{page_count} {next_link}</p>'
    '<p><img src="{image_url}" style="border: solid thin gray;"></p>'
)

SCRIPT_NAME = os.environ['SCRIPT_NAME']


def create_page_url(filename, page_number, type_):
    return '{0}?file={1}&page={2}&type={3}'.format(
        cgi.escape(SCRIPT_NAME, True),
        cgi.escape(filename, True),
        page_number,
        type_
    )


def create_page_link(text, filename, page_number):
    text = cgi.escape(text)
    if page_number is None:
        return '<span style="color: gray;">{0}</span>'.format(text)
    else:
        return '<a href="{0}">{1}</a>'.format(
            create_page_url(filename, page_number, 'html'), text
        )


def get_page_count(filename):

    def parse_line(line):
        key, _, value = line.partition(':')
        return key, value.strip()

    info = dict(
        imap(parse_line, check_output([PDFINFO, filename]).splitlines())
    )
    return int(info['Pages'])


def get_page(filename, page_index):
    return check_output(
        [
            CONVERT,
            '-density', '96',
            '{0}[{1}]'.format(filename, page_index),
            'jpg:-'
        ]
    )


def send_error(message):
    print BASE_TEMPLATE.format(
        title='Error', body='<h1>Error</h1>{0}'.format(message)
    )


def send_page_html(_pdf_path, filename, page_number, page_count):
    body = PDF_PAGE_TEMPLATE.format(
        filename=cgi.escape(filename),
        page=page_number,
        page_count=page_count,
        image_url=create_page_url(filename, page_number, 'jpg'),
        prev_link=create_page_link(
            '<<', filename, page_number - 1 if page_number > 1 else None
        ),
        next_link=create_page_link(
            '>>',
            filename,
            page_number + 1 if page_number < page_count else None
        )
    )
    print BASE_TEMPLATE.format(title='PDF', body=body)


def send_page_image(pdf_path, _filename, page_number, _page_count):
    image_data = get_page(pdf_path, page_number - 1)
    print 'Content-type: image/jpg'
    print 'Content-Length:', len(image_data)
    print
    print image_data


TYPE2SEND_FUNCTION = {
    'html': send_page_html,
    'jpg': send_page_image,
}


def main():
    form = cgi.FieldStorage()
    filename = form.getfirst('file')
    page_number = int(form.getfirst('page', 1))
    type_ = form.getfirst('type', 'html')

    pdf_path = os.path.abspath(os.path.join(DOC_ROOT, filename))
    if os.path.exists(pdf_path) and pdf_path.startswith(DOC_ROOT):
        page_count = get_page_count(pdf_path)
        page_number = min(max(1, page_number), page_count)
        TYPE2SEND_FUNCTION[type_](pdf_path, filename, page_number, page_count)
    else:
        send_error(
            '<p>PDF file <em>{0!r}</em> not found.</p>'.format(
                cgi.escape(filename)
            )
        )


main()

There is Python bindings for libpoppler , so the call to the external pdfinfo program could be replaced quite easily with that module. libpoppler有Python绑定,因此可以很容易地用该模块替换对外部pdfinfo程序的调用。 It may also be used to extract more information for the pages like links on the PDF pages to create HTML image maps for them. 它也可以用于提取页面的更多信息,例如PDF页面上的链接,以为其创建HTML图像映射。 With the libcairo Python bindings installed it may be even possible to do the rendering of a page without an external process. 安装libcairo Python绑定后,甚至可以在无需外部进程的情况下进行页面渲染。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM