简体   繁体   中英

Can't parse phone numbers from a webpage using requests module

I'm trying to find any way to scrape phone numbers from a webpage using requests module. I've got success using selenium but I wish to achieve the same using requests module. I tried a lot to find any clue using chrome dev tools observing netwrok activity but I failed miserably. In case you would like to know how I did it using selenium, I thought to paste the selenium script.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'https://www.numberbarn.com/search?state=New%20Jersey'

with webdriver.Chrome() as driver:
    driver.get(url)
    for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".results-list .container"))):
        phone = WebDriverWait(item,10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".telephone-number"))).text
        print(phone)

How can I parse phone numbers from the above webpage using requests module?

Requests is a module that will do raw GETs of a URI; in this case it will fetch the HTML of that webpage.

If you open that webpage in a browser and view it with Developer tools, you will see none of those phone numbers are actually in the HTML, so Requests (fetch) + XPATH (parse), or tools like Scrapy, probably do not help you. It's basically just a Javascript blob:

  <meta name="twitter:domain" content="numberbarn.com">
  <base href="/">
  <link id="favicon" rel="icon" type="image/x-icon">
  <script async src="//www.googletagmanager.com/gtag/js"></script>
  <script>
    window.dataLayer = window.dataLayer || [];
    function gtag(){dataLayer.push(arguments);}
    gtag('js', new Date());
  </script>
<link rel="stylesheet" href="/angular/styles.d7cd4c8476c1236343ec.css"></head>
<body>
<app-root></app-root>
<link id="brand-stylesheet" rel="stylesheet"/>
<script src="//browser.sentry-cdn.com/5.17.0/bundle.min.js" integrity="sha384-lowBFC6YTkvMIWPORr7+TERnCkZdo5ab00oH5NkFLeQUAmBTLGwJpFjF6djuxJ/5" crossorigin="anonymous"></script>
<script src="/angular/runtime-es2015.cd8b7003cdbc6c84c9fd.js" type="module"></script><script src="/angular/runtime-es5.cd8b7003cdbc6c84c9fd.js" nomodule defer></script><script src="/angular/polyfills-es5.3c509d0a8908a60997e3.js" nomodule defer></script><script src="/angular/polyfills-es2015.ce03948e69242dd06dc0.js" type="module"></script><script src="/angular/vendor-es2015.a7e86119a8ea99d5add3.js" type="module"></script><script src="/angular/vendor-es5.a7e86119a8ea99d5add3.js" nomodule defer></script><script src="/angular/main-es2015.e16ee0657047312eb515.js" type="module"></script><script src="/angular/main-es5.e16ee0657047312eb515.js" nomodule defer></script></body>
</html>

You can also see this with:

curl "https://www.numberbarn.com/search?state=New%20Jersey" > blob.html

and opening blob.html in a text editor.

You really do need something like Selenium, which drives the webpage, and is able to parse it "post" javascript rendering.

TLDR Requests + XPATH can only be used when the page you're trying to parse contains the data you want in the HTML.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM