简体   繁体   中英

how can I scrape some div sections which can not be acquired by beautifulsoup?

I want to scrape the company info from this .

Div section related to data is div class="col-xs-12 col-md-6 col-lg-6 but when run the following code to extract all classes, this class is not available

import requests
from bs4 import BeautifulSoup

page = requests.get("http://gyeonquartz.com/distributors-detailers/")
soup = BeautifulSoup(page.content, 'html.parser')

print(soup.prettify())

When we inspect the web source, all dealer's detail are given under the div class="col-xs-12 col-md-6 col-lg-6" but in parsing, there is no such div.

The data you want to scrap are populated once the page is loaded through an ajax request. When you are making a request through the python Requests library, you are only given the page html.

You have 2 options.

  1. Use selenium (or other options such as requests-html ) to render the javascript loaded contents.

  2. Directly make the ajax requests and get the json response. You can find this by using the network tab on the inspect tool in your browser.

The second option in this case as follows.

import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("http://gyeonquartz.com/wp-admin/admin-ajax.php?action=gyeon_load_partners")
print(page.json())

This will output a very long json. I have converted it into a DataFrame to view it better.

import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("http://gyeonquartz.com/wp-admin/admin-ajax.php?action=gyeon_load_partners")
df=pd.DataFrame.from_dict(page.json())
df['address'] = [BeautifulSoup(text,'html.parser').get_text().replace("\r\n","") for text in df['address'] ]
print(df) #just use df if in jupyter notebook

Sample output from my jupyter notebook is as follows. 在此处输入图片说明

If you look at the page source you'll see that none of the div tags you are looking for exist within the source code of the page. Because requests only makes the initial request and does not load any dynamic content done by javascript the tags you are looking for are not contained within the returned html.

To get the dynamic content you would instead need to mimic whatever requests the page is making (like with a curl request) or load the page within a headless browser(like selenium). The problem is not with the parser but with the content.

Very similar to the solution for How to use requests or other module to get data from a page where the url doesn't change?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM