简体   繁体   中英

How to scrape information off of a blog post via Selenium (Python)

I'm trying to webscrape a blog: https://blog.naver.com/ssamssam48/221271075217

I am trying to get the name of the blog and the author of the blog in the above url. If you go into the source code, both information is available in this portion:

<title>용의주도미스고의 행복만들기♪ : 네이버 블로그</title>
</head>
<script type="text/javascript" 
src="https://ssl.pstatic.net/t.static.blog/mylog/versioning/Frameset- 
584891086_https.js" charset="UTF-8"></script>

<script type="text/javascript" charset="UTF-8">
var photoContent="";
var postContent="";

var videoId       = "";
var thumbnail     = "";
var inKey         = "";
var movieFileSize = "";
var playTime      = "";
var screenSize    = "";

var blogId = 'ssamssam48';
var blogURL = 'https://blog.naver.com';
var eventCnt = '';

var g_ShareObject = {};
g_ShareObject.referer = "";

The name of the blog is within the title tags and the author's id is in var blogId = 'ssamssam48 . I am currently working with Selenium via Python but when I try brower.title I get the title of the post but not the title of the blog as is shown in the source code. As for the author's id, I have absolutely no idea how to get to those var sections

I also tried going about the information a different way - instead of looking at the source code, just looking at the elements section of the Developer Tools bar. Here you can find a section within the wrapper with xpath //*[@id="blog-profile"]/div/div[2] that has the information about the author, but when I search for it through Selenium, it says such element does not exist.

I think part of the problem might be that the body of the post is all hidden within this websection that says #document

在此处输入图片说明

Can anyone help me get the title of the blog and the name of the author? Also what does the hashtag in #document mean??

To retrieve the Page Title ie 오사카 유니버셜스튜디오 입장권 알뜰 구매 완전.. : 네이버블로그 , name of the blog ie 용의주도미스고 and name of the author ie (ssamssam48) you can use the following code block :

  • Code Block :

     # -*- coding: UTF-8 -*- import sys,time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver=webdriver.Firefox(executable_path=r'C:\\Utility\\BrowserDrivers\\geckodriver.exe') driver.get("https://blog.naver.com/ssamssam48/221271075217") print(driver.title) WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//frame[@id='mainFrame']"))) blogName = driver.find_element_by_xpath("//div[@class='nick']/strong").text print(blogName) blogAuthor = driver.find_element_by_xpath("//span[@class='itemfont col']").text print(blogAuthor) driver.quit() 
  • Console Output :

     오사카 유니버셜스튜디오 입장권 알뜰 구매 완전.. : 네이버블로그 용의주도미스고 (ssamssam48) 

Update

As per your question within the comments, through WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//frame[@id='mainFrame']"))) we have induced a waiter which will wait for the desired frame with xpath as //frame[@id='mainFrame'] to be available and then switch to it.

Why to wait for the frame?

As you have invoked the url https://blog.naver.com/ssamssam48/221271075217 in the previous step though the Browser Client (ie the Web Browser ) will return the control back to the WebDriver instance once 'document.readyState' is equal to "complete" is achieved, it still doesn't garuntees that all the WebElements (eg frames , buttons ) on the webpage have completed loading. Hence to wait specifically for the loading completion of the desired frame we induced frame_to_be_available_and_switch_to_it() method.

You will find a detailed discussion in :

You can do this directly using the execute_script method.

driver.get('https://blog.naver.com/ssamssam48/221271075217')
print(driver.execute_script('return blogId'))

The above code prints

ssamssam48

You can modify the above code to get almost all the js variables defined in the script tag.

As for the title, running print driver.title returns

오사카 유니버셜스튜디오 입장권 알뜰 구매 완전.. : 네이버블로그

Which looks right considering you are currently on a particular post. If you want the title of the blog, consider navigating to the blog's home page and running driver.title

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM