简体   繁体   中英

soup.find_all doesnt return anydata

Im trying to get the location of the houses but i dont get any data just "[]". In new to Python and newer to web scraping. Heres my code:

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://www.inmuebles24.com/casas-en-venta-en-tijuana.html'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

location = soup.find_all(class_='posting-location go-to-posting')
print(location)

Upon a close examination, your code should work as expected. Other alternatives for extracting multiple css classes using find_all are included below


location = soup.find_all('span',class_=['posting-location', 'go-to-posting'])

# or

location = soup.find_all(class_='posting-location go-to-posting')

# or

location = soup2.find_all('span',{'class':'posting-location go-to-posting'})

The above was tested after manually copying the source code/html of the page. I received 20 items

Problem

Your real issue, lies in the website you are trying to scrape. The website has utilized measures to reduce bots and persons who may try to scrape their content by using a captcha block.

You may see this if you view the response of your request as follows

print(page.text)

I have copied a snippet of this for your perusal:

![CDATA[\n    var a = function() {try{return !!window.addEventListener} catch(e) {return !1} },\n      b = function(b, c) {a() ? document.addEventListener("DOMContentLoaded", b, c) : document.attachEvent("onreadystatechange", b)};\n      b(function(){\n        var cookiesEnabled=(navigator.cookieEnabled)? true : false;\n        if(!cookiesEnabled){\n          var q = document.getElementById(\'no-cookie-warning\');q.style.display = \'block\';\n        }\n      });\n  //]]>\n  </script>\n  <div id="trk_captcha_js" style="background-image:url(\'/cdn-cgi/images/trace/captcha/nojs/h/transparent.gif?ray=5d997e89698b1414\')"></div>\n</form>\n\n              </div>\n            </div>\n\n            <div class="cf-column">\n              <div class="cf-screenshot-container">\n              \n                <span class="cf-no-screenshot"></span>\n              \n              </div>\n            </div>\n          </div><!-- /.columns -->\n        </div>\n      </div><!-- /.captcha-container -->\n\n      <div class="cf-section cf-wrapper">\n        <div class="cf-columns two">\n          <div class="cf-column">\n            <h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>\n            \n            <p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>\n          </div>\n\n          <div class="cf-column">\n            <h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>\n            \n\n            <p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p>\n\n            <p data-translate="resolve_captcha_network">If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.</p>\n            \n              \n            \n          </div>\n        </div>\n      </div><!-- /.section -->\n      \n\n      <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">\n  <p c

Recommendations

You may consider finding an API or using a method allowed and approved by the site owners.

Try this:

span_tags = soup.find_all('span')
for span in span_tags:
  if span['class'] == 'posting-location go-to-posting':
    print(span.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM