简体   繁体   English

谷歌使用什么应用程序在 gmail 中显示 PDF 附件

[英]What application does google use to show PDF attachments in gmail

I watched the traffic when google displays PDF attachments in gmail in a new window.当谷歌在新窗口中显示 gmail 中的 PDF 附件时,我观察了流量。 The content is served as PNG images for each PDF page.内容作为每个 PDF 页面的 PNG 图像提供。 And its text can be selected.并且可以选择其文本。 What does google use on server side to generate a PNG file for a particular page in a pdf file?谷歌在服务器端使用什么为pdf文件中的特定页面生成PNG文件? How does the selection of text on a png file work? png 文件上的文本选择如何工作? Any ideas?有任何想法吗?

By default attachments are viewed securely using https://docs.google.com/gview , however it turns out you are allowed to request files over plain HTTP.默认情况下,使用https://docs.google.com/gview安全地查看附件,但事实证明您可以通过纯 HTTP 请求文件。 This makes it a little bit easier to figure out what is going on using Wireshark .这使得使用Wireshark更容易弄清楚发生了什么。

As you indicated it was already clear that the PDF is converted on the server side to a PNG ( ImageMagick is indeed a reasonable solution for this purpose), the obvious reason for this is to preserve the exact layout while still being able to view the file without requiring a PDF viewer.正如您所指出的,PDF 在服务器端转换为 PNG( ImageMagick确实是一个合理的解决方案)已经很明显,这样做的明显原因是在仍然能够查看文件的同时保留确切的布局无需 PDF 查看器。

However, from looking at the traffic I found out that the entire PDF is also converted to a custom XML format when calling /gview?a=gt&docid=&chan=&thid= (this is done as soon as you request the document).但是,通过查看流量,我发现在调用 /gview?a=gt&docid=&chan=&thid= 时,整个 PDF 也会转换为自定义 XML 格式(这在您请求文档后立即完成)。 As I couldn't use Wireshark to copy the XML I resorted to the Firefox extension Live HTTP Headers .由于我无法使用 Wireshark 复制 XML,因此我求助于 Firefox 扩展Live HTTP Headers Here's an excerpt:这是摘录:

<pdf2xml>
    <meta name="Author" content="Bruce van der Kooij"/>
    <meta name="Creator" content="Writer"/>
    <meta name="Producer" content="OpenOffice.org 3.0"/>
    <meta name="CreationDate" content="20090218171300+01'00'"/>
    <page t="0" l="0" w="595" h="842">
        <text l="188" t="99" w="213" h="27" p="188,213">Programmabureau</text>
        <text l="85" t="127" w="425" h="27" p="85,117,209,61,277,21,305,124,436,75">Nederland Open in Verbinding (NOiV)</text>
    </page>
</pdf2xml>

I'm not quite sure yet what all the attributes on the text element stand for (with the exception of w and h) but they're obviously the coordinates of the text and possibly length.我还不太确定文本元素上的所有属性代表什么(w 和 h 除外),但它们显然是文本的坐标,可能还有长度。 As the JavaScript Google uses is minimized (or possibly obsfuscated, but this is not likely) figuring out precisely how the client-side selection function works is not quite that easy.由于谷歌使用的 JavaScript 被最小化(或可能被混淆,但这不太可能)准确地弄清楚客户端选择功能是如何工作的并不是那么容易。 But most likely it uses this XML file to figure out what text the user is looking at and then copies that to the user's clipboard.但它很可能使用这个 XML 文件来确定用户正在查看的文本,然后将其复制到用户的剪贴板。

Note that there is an open source (GPL licensed) tool called pdf2xml which has similar but not quite the same output.请注意,有一个名为pdf2xml 的开源(GPL 许可)工具,它具有相似但不完全相同的输出。 Here's the example from their homepage:这是他们主页上的示例:

<?xml version="1.0" encoding="utf-8" ?>
<pdf2xml pages="3">
  <title>My Title</title>
  <page width="780" height="1152">
    <font size="10" face="MHCJMH+FuturaT-Bold" color="#FF0000">
      <text x="324" y="37" width="132" height="10">Friday, September 27, 2002</text>
      <img x="324" y="232" width="277" height="340" src="text_pic0001.png"/>
      <link x="324" y="232" width="277" height="340" dest_page="2" dest_x="141" dest_y="187"/>
    </font>
    <font size="12" face="AGaramond-Regular" italic="true" bold="true">
      <text x="509" y="68" width="121" height="12">This is a test PDF file</text>
      <link x="509" y="68" width="121" height="12" href="www.mobipocket.com"/>
    </font>
  </page>
</pdf2xml>

Hope this information is in any way useful, however like one of the other posters mentioned the only way to be sure what Google does is by asking them.希望这些信息在任何方面都有用,但是就像其他海报中提到的那样,确定谷歌所做的唯一方法就是询问他们。 It's a shame Google doesn't have an official IRC channel but they do have a forum for Google Docs support questions .很遗憾 Google 没有官方 IRC 频道,但他们确实有一个 Google Docs 支持问题论坛

Good luck.祝你好运。

Google uses a non-open-sourced PDF converter app developed in-house. Google 使用内部开发的非开源 PDF 转换器应用程序。 So you're better off looking into the links posted by other answers, since you can't get your hands on the Google version.因此,您最好查看其他答案发布的链接,因为您无法使用 Google 版本。 Sorry!对不起!

if you have the text you can make it what you want offcourse,如果你有文字,你可以让它成为你想要的课外,

more specific you should check out this link : pdf to png using php更具体的你应该看看这个链接: pdf to png using php

so imageMagick will be needed imageMagic所以需要imageMagick imageMagic

edit : another interesting link .编辑:另一个有趣的链接

edit : i found this at google, it looks interesting ... so you could use the google api Google Document List Data Api and this is a blogpost about it Google API Now Lets You Get Documents in Many Formats编辑:我在 google 上发现了这个,它看起来很有趣……所以你可以使用 google api Google Document List Data Api ,这是一篇关于它的博客文章Google API 现在让你获取多种格式的文档

Offcourse to be sure what google uses you need an answer from them ? Offcourse要确定谷歌使用什么你需要他们的答案? :) :)

good luck !祝你好运 !

To see what a pdf is created with, right click on it and go to the Document Properties (in Adobe reader).要查看创建 pdf 的内容,请右键单击它并转到文档属性(在 Adob​​e 阅读器中)。 The PDF producer will show up as the "PDF Producer". PDF 制作者将显示为“PDF 制作者”。 I think google uses both Prince and IText (not in combination for creating PDFs).我认为谷歌同时使用PrinceIText (不能结合使用来创建 PDF)。 Google has created some major modifications on the above toolkits to create that end product. Google 对上述工具包进行了一些重大修改,以创建最终产品。

Well.. this might just be the pdf2xml tool Google is using.嗯.. 这可能只是 Google 正在使用的 pdf2xml 工具。 They only changed they full words width, height etc and they added the p attribute... which turns out to be the attribute containing the coordinates for the words inside the line.他们只更改了完整的单词宽度、高度等,并添加了 p 属性......结果是包含行内单词坐标的属性。 Just played with it and found out :) Going to use this pdf2xml from google :P Upload, let them convert... use xml to transform tooo... epub?刚刚玩了一下,发现:) 打算使用来自 google 的这个 pdf2xml :P 上传,让他们转换...使用 xml 转换太...epub? :P :P

You may also want to investigate use Lucence to index those big pdf files and serve related pages to your users.您可能还想研究使用 Lucence 来索引那些大的 pdf 文件并向您的用户提供相关页面。

See http://www.jguru.com/faq/view.jsp?EID=1074237 for more ideas.有关更多想法,请参阅http://www.jguru.com/faq/view.jsp?EID=1074237

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM