简体   繁体   中英

How to use Python and Beautiful-soup to scrape tags from Instagram

I was trying to find related tags for the top trending tags on instagram but getting none in return while using BeautifulSoup

import requests
import html5lib
import csv
from bs4 import BeautifulSoup

def list_of_tags(tags):
    related_tags = []
    tmp = []
    #for el in tags:
    url = "https://www.instagram.com/explore/tags/love/"
    req = requests.get(url)
    soup = BeautifulSoup(req.content, 'html5lib')
    print(soup)
    r_tag = soup.find('div', attrs = {'class' : 'WSpok'})

I have done scraping using similar code on other websites and was successful with it. But while trying on Instagram i am not getting any HTML content in the soup

<!DOCTYPE html>
<html class="no-js not-logged-in client-root" lang="en"><head>
        <meta charset="utf-8"/>
        <meta content="IE=edge" http-equiv="X-UA-Compatible"/>

        <title>
#love hashtag on Instagram • Photos and Videos
</title>

        
        <meta content="noimageindex, noarchive" name="robots"/>
        <meta content="default" name="apple-mobile-web-app-status-bar-style"/>
        <meta content="yes" name="mobile-web-app-capable"/>
        <meta content="#ffffff" name="theme-color"/>
        <meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, viewport-fit=cover" id="viewport" name="viewport"/>
        <link href="/data/manifest.json" rel="manifest"/>

        <link as="style" crossorigin="anonymous" href="/static/bundles/metro/ConsumerUICommons.css/0d73027e4285.css" rel="preload" type="text/css"/>
<link as="style" crossorigin="anonymous" href="/static/bundles/metro/ConsumerAsyncCommons.css/638f1bd337c8.css" rel="preload" type="text/css"/>
<link as="style" crossorigin="anonymous" href="/static/bundles/metro/Consumer.css/3e0c88f3bf5f.css" rel="preload" type="text/css"/>
<link as="style" crossorigin="anonymous" href="/static/bundles/metro/TagPageContainer.css/47d968faa0fd.css" rel="preload" type="text/css"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/Vendor.js/5a56d51ae30f.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/en_US.js/d9caef98221d.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/ConsumerLibCommons.js/e38c6c343804.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/ConsumerUICommons.js/7906f44838ea.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/ConsumerAsyncCommons.js/2196e3e614ee.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/Consumer.js/624d9b8ef745.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/TagPageContainer.js/63ead1147672.js" rel="preload" type="text/javascript"/>
        
        

        <script type="text/javascript">
        (function() {
  var docElement = document.documentElement;
  var classRE = new RegExp('(^|\\s)no-js(\\s|$)');
  var className = docElement.className;
  docElement.className = className.replace(classRE, '$1js$2');
})();
</script>
        <script type="text/javascript">
(function() {
  if ('PerformanceObserver' in window && 'PerformancePaintTiming' in window) {
    window.__bufferedPerformance = [];
    var ob = new PerformanceObserver(function(e) {
      window.__bufferedPerformance.push.apply(window.__bufferedPerformance,e.getEntries());
    });
    ob.observe({entryTypes:['paint']});
  }

  window.__bufferedErrors = [];
  window.onerror = function(message, url, line, column, error) {
    window.__bufferedErrors.push({
      message: message,
      url: url,
      line: line,
      column: column,
      error: error
    });
    return false;
  };
  window.__initialData = {
    pending: true,
    waiting: []
  };
  function asyncFetchSharedData(extra) {
    var sharedDataReq = new XMLHttpRequest();
    sharedDataReq.onreadystatechange = function() {
          if (sharedDataReq.readyState === 4) {
            if(sharedDataReq.status === 200){
              var sharedData = JSON.parse(sharedDataReq.responseText);
              window.__initialDataLoaded(sharedData, extra);
            }
          }
        }
    sharedDataReq.open('GET', '/data/shared_data/', true);
    sharedDataReq.send(null);
  }
  function notifyLoaded(item, data) {
    item.pending = false;
    item.data = data;
    for (var i = 0;i < item.waiting.length; ++i) {
      item.waiting[i].resolve(item.data);
    }
    item.waiting = [];
  }
  function notifyError(item, msg) {
    item.pending = false;
    item.error = new Error(msg);
    for (var i = 0;i < item.waiting.length; ++i) {
      item.waiting[i].reject(item.error);
    }
    item.waiting = [];
  }
  window.__initialDataLoaded = function(initialData, extraData) {
    if (extraData) {
      for (var key in extraData) {
        initialData[key] = extraData[key];
      }
    }
    notifyLoaded(window.__initialData, initialData);
  };
  window.__initialDataError = function(msg) {
    notifyError(window.__initialData, msg);
  };
  window.__additionalData = {};
  window.__pendingAdditionalData = function(paths) {
    for (var i = 0;i < paths.length; ++i) {
      window.__additionalData[paths[i]] = {
        pending: true,
        waiting: []
      };
    }
  };
  window.__additionalDataLoaded = function(path, data) {
    if (path in window.__additionalData) {
      notifyLoaded(window.__additionalData[path], data);
    } else {
      console.error('Unexpected additional data loaded "' + path + '"');
    }
  };
  window.__additionalDataError = function(path, msg) {
    if (path in window.__additionalData) {
      notifyError(window.__additionalData[path], msg);
    } else {
      console.error('Unexpected additional data encountered an error "' + path + '": ' + msg);
    }
  };
  
})();
</script><script type="text/javascript">

/*
 Copyright 2018 Google Inc. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
*/

(function(){function g(a,c){b||(b=a,f=c,h.forEach(function(a){removeEventListener(a,l,e)}),m())}function m(){b&&f&&0<d.length&&(d.forEach(function(a){a(b,f)}),d=[])}function n(a,c){function k(){g(a,c);d()}function b(){d()}function d(){removeEventListener("pointerup",k,e);removeEventListener("pointercancel",b,e)}addEventListener("pointerup",k,e);addEventListener("pointercancel",b,e)}function l(a){if(a.cancelable){var c=performance.now(),b=a.timeStamp;b>c&&(c=+new Date);c-=b;"pointerdown"==a.type?n(c,
a):g(c,a)}}var e={passive:!0,capture:!0},h=["click","mousedown","keydown","touchstart","pointerdown"],b,f,d=[];h.forEach(function(a){addEventListener(a,l,e)});window.perfMetrics=window.perfMetrics||{};window.perfMetrics.onFirstInputDelay=function(a){d.push(a);m()}})();
</script>
    
                <link href="/static/images/ico/apple-touch-icon-76x76-precomposed.png/666282be8229.png" rel="apple-touch-icon-precomposed" sizes="76x76"/>
                <link href="/static/images/ico/apple-touch-icon-120x120-precomposed.png/8a5bd3f267b1.png" rel="apple-touch-icon-precomposed" sizes="120x120"/>
                <link href="/static/images/ico/apple-touch-icon-152x152-precomposed.png/68193576ffc5.png" rel="apple-touch-icon-precomposed" sizes="152x152"/>
                <link href="/static/images/ico/apple-touch-icon-167x167-precomposed.png/4985e31c9100.png" rel="apple-touch-icon-precomposed" sizes="167x167"/>
                <link href="/static/images/ico/apple-touch-icon-180x180-precomposed.png/c06fdb2357bd.png" rel="apple-touch-icon-precomposed" sizes="180x180"/>
                
                    <link href="/static/images/ico/favicon-192.png/68d99ba29cc8.png" rel="icon" sizes="192x192"/>
                
            
            
                    <link color="#262626" href="/static/images/ico/favicon.svg/fc72dd4bfde8.svg" rel="mask-icon"/>
                  
                  <link href="/static/images/ico/favicon.ico/36b3ee2d91ed.ico" rel="shortcut icon" type="image/x-icon"/>
                
            
            
            
   

</head>
    <body class="" style="
    background: white;
">
        
 

</body></html>

I tried calling specific div but it was of no use. I have other methods using JSON request and all but i want to know how I can improve this version. thanks in advance

I finally did it with json

import requests
import html5lib
import json
import time
import csv
from bs4 import BeautifulSoup

def list_of_tags(tags):
    related_tags = []
    for el in tags:
        url = "https://www.instagram.com/explore/tags/"+ el +"/?__a=1"
        req = requests.get(url)
        data = json.loads(req.text)
        edges = data['graphql']['hashtag']['edge_hashtag_to_related_tags']['edges']
        for item in edges:
            related_tags.append(item['node']['name'])
    print(related_tags)

It will give you all the related tags to the tags you are looking for

hope it would be helpful for someone.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM