简体   繁体   中英

Any option to bypass Incapsula protection in python3 while scraping?

I'm new in scraping, and I'm already blocked by the Incapsula protection.

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

page_soup.h1 

I can't access any data from the website because I'm blocked by the InCapsula problem...
When I type :

print(page_soup)

I get this message:

<html style="height:100%"><head><meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/><meta content="telephone=no" name="format-detection"/>
[...]
Request unsuccessful. Incapsula incident ID: 936002200207012991-

I did some tests described here Getting 'wrong' page source when calling url from python and only the workaround of @Karl Anka worked out.

See the example below:

from bs4 import BeautifulSoup
from selenium import webdriver

url = 'https://www.immoweb.be/fr/recherche/immeuble-de-rapport/a-vendre'

driver = webdriver.Chrome(executable_path='./chromedriver')
driver.get(url)

soup = BeautifulSoup(driver.page_source, features='html.parser')
driver.quit()

print(soup.prettify())

Output:

<html class="js flexbox rgba borderradius boxshadow opacity cssgradients csstransitions generatedcontent localstorage sessionstorage" style="" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <script async="" src="https://c.pebblemedia.be/js/data/david/_david_publishers_master_produpress.js" type="text/javascript">
  </script>
  <script async="" src="https://scdn.cxense.com/cx.js" type="text/javascript">
  </script>
  <script async="" src="https://connect.facebook.net/signals/plugins/inferredEvents.js?v=2.8.47">
  </script>
  <script async="" src="https://connect.facebook.net/signals/config/1554445828209863?v=2.8.47&amp;r=stable">
  </script>
[...]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM