简体   繁体   English

使用 Python 3.x 在 websocket 上进行网络抓取

[英]webscraping on websocket streaming using Python 3.x

I've been webscraping for a long time and recently decided to scrape a video stream via websocket streaming.我已经进行了很长时间的网络抓取,最近决定通过 websocket 流媒体抓取视频 stream。 I fully understand websockets and how they work, but I don't fully understand the streaming part.我完全理解 websockets 及其工作原理,但我不完全理解流媒体部分。 I'm trying to scrape a stream where I get base64 data using Python 3.10, and when I try to decode it I find that it can't be read (exactly because it's data from the video stream).我正在尝试抓取 stream,其中我使用 Python 3.10 获得 base64 数据,当我尝试解码它时,我发现它无法读取(正是因为它是来自视频流的数据)。 The stream I'm trying to extract is from a company that provides some weather data and I need to get that data without needing to use Selenium or some other library for testing.我试图提取的 stream 来自一家提供一些天气数据的公司,我需要获取这些数据而无需使用 Selenium 或其他一些库进行测试。 Is there any effective way to do this?有什么有效的方法可以做到这一点? Maybe some well performing library, or some way to "read" the data from the stream somehow?也许一些性能良好的库,或者某种方式以某种方式从 stream 中“读取”数据?

Here is an impression that I took from the data obtained by the websocket:这是我从 websocket 获得的数据中得出的印象: 在此处输入图像描述

Even after trying to decode the obtained base64 to utf-8, the result is the same as the image above.即使尝试将获得的 base64 解码为 utf-8,结果也与上图相同。

I can recommend this package: https://github.com/websocket-client/websocket-client我可以推荐这个 package: https://github.com/websocket-client/websocket-client

It is pretty simple and stable and it works flawlessly.它非常简单和稳定,并且可以完美运行。 Also it supports asyncio.它还支持异步。

def on_message(ws, message):
    ...

def on_open(ws):
    ...

def on_close(ws, close_status_code, close_msg):
    ...

def on_error(ws, error):
    ...

ws = websocket.WebSocketApp(
    "wss://<address>",
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close,
)
ws.run_forever()

Usually when scraping WS you need to initiate the proccess by sending some command (you can track it by Dev Tools also, this package will be marked as green up arrow).通常在抓取 WS 时,你需要通过发送一些命令来启动进程(你也可以通过开发工具跟踪它,这个 package 将被标记为绿色向上箭头)。 Then you can reproduce it by using ws.send("<message>")然后您可以使用ws.send("<message>")重现它

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM