简体   繁体   中英

Can't get data from inside of span-tag with beautifulsoup

I am trying to scrape Instagram page, and want to get/access div-tags present inside of span-tag. but I can't! the HTML of the Instagram page looks like as

 <head>--</head>
    <body>
       <span id="react-root" aria-hidden="false">
       <form enctype="multipart/form-data" method="POST" role="presentation">…</form>
       <section class="_9eogI E3X2T">
          <main class="SCxLW  o64aR" role="main">
             <div class="v9tJq VfzDr">
                 <header class=" HVbuG">…</header>
                 <div class="_4bSq7">…</div>
                 <div class="fx7hk">…</div>
             </div>
          </main>
      </section>
    </body>

I do, it as

from bs4 import BeautifulSoup
import urllib.request as urllib2
html_page = urllib2.urlopen("https://www.instagram.com/cherrified_/?hl=en")
soup = BeautifulSoup(html_page,"lxml")
span_tag = soup.find('span') # return span-tag correctly
span_tag.find_all('div')    # return empty list, why ?

please also specify an example.

Instagram is a Single Page Application powered by React, which means its source is just a simple "empty" page that loads JavaScript to dynamically generate the content in the browser after downloading.

Click "View source" or go to view-source:https://www.instagram.com/cherrified_/?hl=en in Chrome. This is the HTML you download with urllib.request .

You can see that there is a single <span> tag, which does not include a <div> tag. (Note: <div> inside a <span> is not allowed ).

Scraping instagram.com this way is not possible. It also might not be legal (I am not a lawyer).

Notes:

  • your HTML code example doesn't include a closing tag for <span> .
  • your HTML code example doesn't match the link you provide in the python snippet.
  • in the last line of the python snippet you probably meant span_tag.find_all('div') (note the variable name and the singular 'div' ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM