简体   繁体   English

如何通过Selenium(Python)从博客文章中抓取信息

[英]How to scrape information off of a blog post via Selenium (Python)

I'm trying to webscrape a blog: https://blog.naver.com/ssamssam48/221271075217 我正在尝试网上博客: https ://blog.naver.com/ssamssam48/221271075217

I am trying to get the name of the blog and the author of the blog in the above url. 我试图在上面的URL中获得博客的名称和博客的作者。 If you go into the source code, both information is available in this portion: 如果您进入源代码,则这部分都提供了这两种信息:

<title>용의주도미스고의 행복만들기♪ : 네이버 블로그</title>
</head>
<script type="text/javascript" 
src="https://ssl.pstatic.net/t.static.blog/mylog/versioning/Frameset- 
584891086_https.js" charset="UTF-8"></script>

<script type="text/javascript" charset="UTF-8">
var photoContent="";
var postContent="";

var videoId       = "";
var thumbnail     = "";
var inKey         = "";
var movieFileSize = "";
var playTime      = "";
var screenSize    = "";

var blogId = 'ssamssam48';
var blogURL = 'https://blog.naver.com';
var eventCnt = '';

var g_ShareObject = {};
g_ShareObject.referer = "";

The name of the blog is within the title tags and the author's id is in var blogId = 'ssamssam48 . 博客的名称在标题标签内,作者的ID在var blogId = 'ssamssam48 I am currently working with Selenium via Python but when I try brower.title I get the title of the post but not the title of the blog as is shown in the source code. 我目前正在通过Python使用Selenium,但是当我尝试brower.title我得到的是帖子的标题,而不是源代码中显示的博客的标题。 As for the author's id, I have absolutely no idea how to get to those var sections 至于作者的身份证,我绝对不知道如何进入这些var部分

I also tried going about the information a different way - instead of looking at the source code, just looking at the elements section of the Developer Tools bar. 我还尝试了以另一种方式处理信息-而不是查看源代码,而只是查看“开发人员工具”栏的“元素”部分。 Here you can find a section within the wrapper with xpath //*[@id="blog-profile"]/div/div[2] that has the information about the author, but when I search for it through Selenium, it says such element does not exist. 在这里,您可以在包装中找到带有xpath //*[@id="blog-profile"]/div/div[2] ,其中包含有关作者的信息,但是当我通过Selenium搜索时,它会说这样的元素不存在。

I think part of the problem might be that the body of the post is all hidden within this websection that says #document 我认为问题的一部分可能是该职位的身体所有隐藏这个websection,说内#document

在此处输入图片说明

Can anyone help me get the title of the blog and the name of the author? 谁能帮我获得博客标题和作者姓名? Also what does the hashtag in #document mean?? 还有#document中的#标签是什么意思?

To retrieve the Page Title ie 오사카 유니버셜스튜디오 입장권 알뜰 구매 완전.. : 네이버블로그 , name of the blog ie 용의주도미스고 and name of the author ie (ssamssam48) you can use the following code block : 要检索页面标题,오사카스튜디오알뜰완전완전..:네이버블로그 ,博客名称(即용의주도미스고)和作者的名称(即(ssamssam48)) ,可以使用以下代码块:

  • Code Block : 代码块:

     # -*- coding: UTF-8 -*- import sys,time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver=webdriver.Firefox(executable_path=r'C:\\Utility\\BrowserDrivers\\geckodriver.exe') driver.get("https://blog.naver.com/ssamssam48/221271075217") print(driver.title) WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//frame[@id='mainFrame']"))) blogName = driver.find_element_by_xpath("//div[@class='nick']/strong").text print(blogName) blogAuthor = driver.find_element_by_xpath("//span[@class='itemfont col']").text print(blogAuthor) driver.quit() 
  • Console Output : 控制台输出:

     오사카 유니버셜스튜디오 입장권 알뜰 구매 완전.. : 네이버블로그 용의주도미스고 (ssamssam48) 

Update 更新

As per your question within the comments, through WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//frame[@id='mainFrame']"))) we have induced a waiter which will wait for the desired frame with xpath as //frame[@id='mainFrame'] to be available and then switch to it. 根据您在评论中的问题,通过WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//frame[@id='mainFrame']")))我们诱导了一个服务员等待xpath//frame[@id='mainFrame']的期望可用,然后切换到该//frame[@id='mainFrame']

Why to wait for the frame? 为什么要等待帧?

As you have invoked the url https://blog.naver.com/ssamssam48/221271075217 in the previous step though the Browser Client (ie the Web Browser ) will return the control back to the WebDriver instance once 'document.readyState' is equal to "complete" is achieved, it still doesn't garuntees that all the WebElements (eg frames , buttons ) on the webpage have completed loading. 尽管在上一步中调用了URL https://blog.naver.com/ssamssam48/221271075217 ,但一旦'document.readyState'相等, 浏览器客户端 (即Web浏览器 )将把控件返回给WebDriver实例。达到"complete"状态,仍然不保证网页上的所有WebElement (例如框架按钮 )均已完成加载。 Hence to wait specifically for the loading completion of the desired frame we induced frame_to_be_available_and_switch_to_it() method. 因此,为了专门等待所需帧的加载完成,我们引入了frame_to_be_available_and_switch_to_it()方法。

You will find a detailed discussion in : 您可以在以下位置找到详细的讨论:

You can do this directly using the execute_script method. 您可以使用execute_script方法直接执行此操作。

driver.get('https://blog.naver.com/ssamssam48/221271075217')
print(driver.execute_script('return blogId'))

The above code prints 上面的代码打印

ssamssam48 ssamssam48

You can modify the above code to get almost all the js variables defined in the script tag. 您可以修改上面的代码以获取script标记中定义的几乎所有js变量。

As for the title, running print driver.title returns 至于标题,运行print driver.title返回

오사카 유니버셜스튜디오 입장권 알뜰 구매 완전.. : 네이버블로그 。사카스유니버셜완전...:네이버블로그

Which looks right considering you are currently on a particular post. 考虑到您当前在某个特定职位上,这看起来不错。 If you want the title of the blog, consider navigating to the blog's home page and running driver.title 如果您想要博客的标题,请考虑导航到博客的主页并运行driver.title

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM