简体   繁体   English

使用python使用javascript从网页抓取数据

[英]Data scraping from a webpage with javascript using python

I'm trying to scrape the title off of a webpage. 我正试图从网页上删除标题。 Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. 最初,我尝试使用BeautifulSoup,但发现如果没有Javascript,页面本身就无法加载。 So I'm using some code that I found off Google that use the request-html library: 所以我使用了一些我在谷歌上发现的使用request-html库的代码:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")

soup.find_all('h1')

But there's always an error along the line of: 但总是有一个错误:

D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
    'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
    resp.html.render()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
    content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
  File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
    content = await page.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
    return await frame.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
    '''.strip())
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
    _rewriteError(e)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
    raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.

Process finished with exit code 1

Does anyone know what this means? 有谁知道这意味着什么? I'm quite new to this, so I apologize if I'm using any terminology improperly. 我对此很陌生,所以如果我不正确地使用任何术语,我会道歉。

Seems like a bug in underlying library puppeteer , caused by processing some javascript. 看起来像是由于处理一些javascript导致的底层库puppeteer的错误。 Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251 , maybe it'll help. 这是来自https://github.com/kennethreitz/requests-html/issues/251的一种解决方法,也许它会有所帮助。

resp.html.render(sleep=1, keep_page=True)

You need to load the JS because if you don't load it the HTML code wont load. 您需要加载JS,因为如果您不加载它,HTML代码将不会加载。 You can use Selenium 你可以使用Selenium

Try Seleneum. 试试Seleneum。

Seleneum is a library that allows programs to interact with web pages by taking control of the browser. Seleneum是一个允许程序通过控制浏览器与网页交互的库。

Here is an example in an answer to someone else's question. 这是一个回答别人问题的例子。

As Ivan said, here you have full code: sleep=1, keep_page=True make the trick 正如Ivan所说,这里有完整的代码:sleep = 1,keep_page = True制作技巧

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))

Response: 响应:

[<title>
    Milled wheat and wheat flour produced</title>]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM