简体繁体 English

使用Python Selenium Webdriver捕获PDF文件

[英]Capturing PDF files using Python Selenium Webdriver

原文 2013-02-01 17:40:43 1 1 python/ pdf/ selenium/ webdriver/ selenium-webdriver

We test an application developed in house using a python test suite which accomplishes web navigations/interactions through Selenium WebDriver. 我们使用python测试套件测试了内部开发的应用程序，该套件可通过Selenium WebDriver完成Web导航/交互。 A tricky part of our web testing is in dealing with a series of pdf reports in the app. 我们的网络测试的棘手部分是处理应用程序中的一系列pdf报告。 We are testing a planned upgrade of Firefox from v3.6 to v16.0.1, and it turns out that the way we captured reports before no longer works, because of changes in the directory structure of firefox's temp folder. 我们正在测试Firefox从v3.6升级到v16.0.1的计划升级，事实证明，由于firefox的temp文件夹的目录结构发生了更改，因此无法再使用捕获报告的方式。 I didn't write the original pdf capturing code, but I will refactor it for whatever we end up using with v16.0.1, so I was wondering if there' sa better way to save a pdf using Python's selenium webdriver bindings than what we're currently doing. 我没有写原始的pdf捕获代码，但是我将对v16.0.1最终使用的代码进行重构，所以我想知道是否存在比使用Python的Selenium Webdriver绑定保存pdf更好的方法。目前正在做。

Previously, for Firefox v3.6, after clicking a link that generates a report, we would scan the "C:\\Documents and Settings\\\\Local Settings\\Temp\\plugtmp" directory for a pdf file (with a specific name convention) to be generated. 以前，对于Firefox v3.6，单击生成报告的链接后，我们将在“ C：\\ Documents and Settings \\\\ Local Settings \\ Temp \\ plugtmp”目录中扫描pdf文件（具有特定的名称约定），以被生成。 To be clear, we're not saving the report from the webpage itself, we're just using the one generated in firefox's Temp folder. 需要明确的是，我们不是从网页本身保存报告，而是使用在firefox的Temp文件夹中生成的报告。

In Firefox 16.0.1, after clicking a link that generates a report, the file is generated in "C:\\Documents and Settings\\ \\Local Settings\\Temp\\tmp*\\cache*", with a random file name, not ending in ".pdf". 在Firefox 16.0.1中，单击生成报告的链接后，将在文件“ C：\\ Documents and Settings \\\\ Local Settings \\ Temp \\ tmp * \\ cache *”中生成文件，文件名为随机文件，结尾不为“ .pdf”。 This makes capturing this file somewhat more difficult, if using a technique similar to our previous one - each browser has a different tmp*** folder, which has a cache full of folders, inside of which the report is generated with a random file name. 如果使用与上一个类似的技术，则捕获该文件会更加困难-每个浏览器都有一个不同的tmp ***文件夹，该文件夹具有一个充满文件夹的缓存，其中的报告是使用随机文件名生成的。

The easiest solution I can see would be to directly save the pdf, but I haven't found a way to do that yet. 我能看到的最简单的解决方案是直接保存pdf，但我还没有找到一种方法来保存。

To use the same approach as we used in FF3.6 (finding the pdf in the Temp folder directory), I'm thinking we'll need to do the following: 要使用与FF3.6中相同的方法（在Temp文件夹目录中找到pdf），我认为我们需要执行以下操作：

Figure out which tmp*** folder belongs to this particular browser instance (which we can do be inspecting the tmp*** folders that exist before and after the browser is instantiated) 找出哪个tmp ***文件夹属于此特定浏览器实例（我们可以通过检查实例化浏览器之前和之后存在的tmp ***文件夹来做到这一点）
Look inside that browser's cache for a file generated immedaitely after the pdf report was generated (which we can by comparing timestamps) 在浏览器的缓存中查找在生成pdf报告后立即生成的文件（我们可以通过比较时间戳）
In cases where multiple files are generated in the cache, we could possibly sort based on size, and take the largest file, since the pdf will almost certainly be the largest temp file (although this seems flaky and will need to be tested in practice). 如果在缓存中生成了多个文件，则我们可能会根据大小排序并获取最大的文件，因为pdf几乎肯定是最大的临时文件（尽管这看起来很不稳定，并且需要在实践中进行测试）。

I'm not feeling great about this approach, and was wondering if there's a better way to capture pdf files. 我对这种方法感觉不太好，并且想知道是否有更好的方法来捕获pdf文件。 Can anyone suggest a better approach? 谁能建议一个更好的方法？

Note: the actual scraping of the PDF file is still working fine. 注意：PDF文件的实际抓取仍然可以正常工作。