使用python使用javascript从网页抓取数据

Question

I'm trying to scrape the title off of a webpage. 我正试图从网页上删除标题。 Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. 最初，我尝试使用BeautifulSoup，但发现如果没有Javascript，页面本身就无法加载。 So I'm using some code that I found off Google that use the request-html library: 所以我使用了一些我在谷歌上发现的使用request-html库的代码：

from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")

soup.find_all('h1')

But there's always an error along the line of: 但总是有一个错误：

D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
    'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
    resp.html.render()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
    content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
  File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
    content = await page.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
    return await frame.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
    '''.strip())
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
    _rewriteError(e)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
    raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.

Process finished with exit code 1

Does anyone know what this means? 有谁知道这意味着什么？ I'm quite new to this, so I apologize if I'm using any terminology improperly. 我对此很陌生，所以如果我不正确地使用任何术语，我会道歉。

Answer 1

Seems like a bug in underlying library puppeteer , caused by processing some javascript. 看起来像是由于处理一些javascript导致的底层库puppeteer的错误。 Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251 , maybe it'll help. 这是来自https://github.com/kennethreitz/requests-html/issues/251的一种解决方法，也许它会有所帮助。

resp.html.render(sleep=1, keep_page=True)

Answer 2

You need to load the JS because if you don't load it the HTML code wont load. 您需要加载JS，因为如果您不加载它，HTML代码将不会加载。 You can use Selenium 你可以使用Selenium

Answer 3

Try Seleneum. 试试Seleneum。

Seleneum is a library that allows programs to interact with web pages by taking control of the browser. Seleneum是一个允许程序通过控制浏览器与网页交互的库。

Here is an example in an answer to someone else's question. 这是一个回答别人问题的例子。

Answer 4

As Ivan said, here you have full code: sleep=1, keep_page=True make the trick 正如Ivan所说，这里有完整的代码：sleep = 1，keep_page = True制作技巧

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))

Response: 响应：

[<title>
    Milled wheat and wheat flour produced</title>]

使用python使用javascript从网页抓取数据

问题描述

4 个解决方案

解决方案1
0 2019-06-24 23:39:20

解决方案2
0 2019-06-24 23:45:55

解决方案3
0 2019-06-24 23:45:55

解决方案4
0 2019-06-24 23:47:57

使用python使用javascript从网页抓取数据

问题描述

4 个解决方案

解决方案1 0 2019-06-24 23:39:20

解决方案2 0 2019-06-24 23:45:55

解决方案3 0 2019-06-24 23:45:55

解决方案4 0 2019-06-24 23:47:57

解决方案1
0 2019-06-24 23:39:20

解决方案2
0 2019-06-24 23:45:55

解决方案3
0 2019-06-24 23:45:55

解决方案4
0 2019-06-24 23:47:57