[英]Can't scrape some “div” tags from a site
I am trying to scrape job posts from this page: https://www.fl.ru . 我正在尝试从以下页面抓取职位发布: https : //www.fl.ru 。
Probably quite a newbie problem, but it turns out I can get certain tags, while others seem to be unreachable, eg: 可能是一个新手问题,但事实证明我可以获得某些标签,而其他标签似乎无法访问,例如:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.fl.ru/projects/")
bsObj = BeautifulSoup(html, "lxml")
textTags = bsObj.findAll("div", class_="b-post__txt ")
print(str(textTags))
Thanks 谢谢
If you download the page html using some downloader ( wget
or curl
) you will see that the elements are not in the page at all. 如果使用某些下载程序(
wget
或curl
)下载html页面,您将看到元素根本不在页面中。 The elements are being generated by javascript. 元素是由javascript生成的。
For example (snippet from the source of the page): 例如(来自页面源的摘录):
<script type="text/javascript">document.write('<div class="b-post__body b-post__body_padtop_15 b-post__body_overflow_hidden b-layuot_width_full"> <div class="b-post__txt "> У нас есть для вас вакансия Full-stack PHP-разработчика на удаленную работу (полный рабочий день) или в офис (г. Москва). Работать нужно будет над нашими проектами, в том... </div> <div id="project-reason-3728923" style="display: none"> </div> </div>');</script>
You have two options: Execute the javascript (with a browser and something like selenium to drive it) or parse it manually, by using beautiful soup to get the <script>
tag contents, then extracting the text inside document.write()
and reparsing it with beautiful soup. 您有两个选择:通过使用漂亮的汤来获取
<script>
标签内容,执行javascript(使用浏览器和类似selenium的驱动程序)或手动解析它,然后提取document.write()
的文本并重新解析它与美丽的汤。
Many modern webpages build the DOM in the browser dynamically using Javascript, and the parts you're looking for do not exist until the browser has finished building the page. 许多现代网页都使用Javascript在浏览器中动态构建DOM,而您要查找的部分在浏览器完成页面构建之前就不存在。
If you're not using a browser or library that has Javascript functionality, the page elements you're looking for will simply not exist. 如果您不使用具有Javascript功能的浏览器或库,则所寻找的页面元素将根本不存在。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.