简体   繁体   English

如何使用 python 请求、BeautifulSoup 和/或 scrapy 或 selenium 抓取混淆的网页内容

[英]How to crawl obfuscated web page contents using python requests ,BeautifulSoup and/or scrapy or selenium

I was able to crawl twitter content using GET parameter and parse the data in BeautifulSoup but now all of the website seems obfuscated for HTML elements.我能够使用 GET 参数抓取 Twitter 内容并解析 BeautifulSoup 中的数据,但现在所有网站似乎都对 HTML 元素进行了混淆。

https://www.twitter.com/search?q=donald%20trump&src=typed_query&f=user https://www.twitter.com/search?q=donald%20trump&src=typed_query&f=user

this is what I was using to fetch joining date of multiple users named "Donald Trump"这就是我用来获取名为“Donald Trump”的多个用户的加入日期的方法

python and beautifulsoup蟒蛇和美丽的汤

op_date_time=soup.find_all(class_='ProfileHeaderCard-joinDateText js-tooltip u-dir')
print(op_date_time)

This is how the obfuscated code looks now:这是混淆代码现在的样子:

 <span class="css-901oao css-16my406 r-1re7ezh r-4qtqp9 r-1qd0xha r-ad9z0x r-zso239 r-bcqeeo r-qvutc0"><svg viewBox="0 0 24 24" class="r-1re7ezh r-4qtqp9 r-yyyyoo r-1xvli5t r-7o8qx1 r-dnmrzs r-bnwqim r-1plcrui r-lrvibr"><g><path d="M19.708 2H4.292C3.028 2 2 3.028 2 4.292v15.416C2 20.972 3.028 22 4.292 22h15.416C20.972 22 22 20.972 22 19.708V4.292C22 3.028 20.972 2 19.708 2zm.792 17.708c0 .437-.355.792-.792.792H4.292c-.437 0-.792-.355-.792-.792V6.418c0-.437.354-.79.79-.792h15.42c.436 0 .79.355.79.79V19.71z"></path><circle cx="7.032" cy="8.75" r="1.285"></circle><circle cx="7.032" cy="13.156" r="1.285"></circle><circle cx="16.968" cy="8.75" r="1.285"></circle><circle cx="16.968" cy="13.156" r="1.285"></circle><circle cx="12" cy="8.75" r="1.285"></circle><circle cx="12" cy="13.156" r="1.285"></circle><circle cx="7.032" cy="17.486" r="1.285"></circle><circle cx="12" cy="17.486" r="1.285"></circle></g></svg>Joined March 2009</span>

I don't think it was obfuscated code, just html code without id.我不认为这是混淆代码,只是没有 id 的 html 代码。

You can use built-in css selector on BeautifulSoup4 .您可以在 BeautifulSoup4 上使用内置的css 选择器

Example for you test on Doda Tram twitter should be:您在 Doda Tram 推特上测试的示例应该是:

joindt = soup.select('span.r-1re7ezh:nth-child(3)')

Edit 1: twitter seems to use javascript render, wich python requests library cannot handle, so if you use selenium or other like this for javascript you should search for css selector function on it.编辑 1:twitter 似乎使用 javascript 渲染,但 python 请求库无法处理,所以如果你使用 selenium 或其他类似的 javascript 你应该在它上面搜索 css 选择器函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM