簡體   English   中英

如何使用 Python 和 Beautiful-soup 從 Instagram 抓取標簽

[英]How to use Python and Beautiful-soup to scrape tags from Instagram

我試圖在 instagram 上找到最熱門標簽的相關標簽,但在使用 BeautifulSoup 時沒有得到任何回報

import requests
import html5lib
import csv
from bs4 import BeautifulSoup

def list_of_tags(tags):
    related_tags = []
    tmp = []
    #for el in tags:
    url = "https://www.instagram.com/explore/tags/love/"
    req = requests.get(url)
    soup = BeautifulSoup(req.content, 'html5lib')
    print(soup)
    r_tag = soup.find('div', attrs = {'class' : 'WSpok'})

我已經在其他網站上使用類似的代碼進行了抓取,並且成功了。 但是在嘗試 Instagram 時,我沒有在湯中得到任何 HTML 內容

<!DOCTYPE html>
<html class="no-js not-logged-in client-root" lang="en"><head>
        <meta charset="utf-8"/>
        <meta content="IE=edge" http-equiv="X-UA-Compatible"/>

        <title>
#love hashtag on Instagram • Photos and Videos
</title>

        
        <meta content="noimageindex, noarchive" name="robots"/>
        <meta content="default" name="apple-mobile-web-app-status-bar-style"/>
        <meta content="yes" name="mobile-web-app-capable"/>
        <meta content="#ffffff" name="theme-color"/>
        <meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, viewport-fit=cover" id="viewport" name="viewport"/>
        <link href="/data/manifest.json" rel="manifest"/>

        <link as="style" crossorigin="anonymous" href="/static/bundles/metro/ConsumerUICommons.css/0d73027e4285.css" rel="preload" type="text/css"/>
<link as="style" crossorigin="anonymous" href="/static/bundles/metro/ConsumerAsyncCommons.css/638f1bd337c8.css" rel="preload" type="text/css"/>
<link as="style" crossorigin="anonymous" href="/static/bundles/metro/Consumer.css/3e0c88f3bf5f.css" rel="preload" type="text/css"/>
<link as="style" crossorigin="anonymous" href="/static/bundles/metro/TagPageContainer.css/47d968faa0fd.css" rel="preload" type="text/css"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/Vendor.js/5a56d51ae30f.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/en_US.js/d9caef98221d.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/ConsumerLibCommons.js/e38c6c343804.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/ConsumerUICommons.js/7906f44838ea.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/ConsumerAsyncCommons.js/2196e3e614ee.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/Consumer.js/624d9b8ef745.js" rel="preload" type="text/javascript"/>
<link as="script" crossorigin="anonymous" href="/static/bundles/metro/TagPageContainer.js/63ead1147672.js" rel="preload" type="text/javascript"/>
        
        

        <script type="text/javascript">
        (function() {
  var docElement = document.documentElement;
  var classRE = new RegExp('(^|\\s)no-js(\\s|$)');
  var className = docElement.className;
  docElement.className = className.replace(classRE, '$1js$2');
})();
</script>
        <script type="text/javascript">
(function() {
  if ('PerformanceObserver' in window && 'PerformancePaintTiming' in window) {
    window.__bufferedPerformance = [];
    var ob = new PerformanceObserver(function(e) {
      window.__bufferedPerformance.push.apply(window.__bufferedPerformance,e.getEntries());
    });
    ob.observe({entryTypes:['paint']});
  }

  window.__bufferedErrors = [];
  window.onerror = function(message, url, line, column, error) {
    window.__bufferedErrors.push({
      message: message,
      url: url,
      line: line,
      column: column,
      error: error
    });
    return false;
  };
  window.__initialData = {
    pending: true,
    waiting: []
  };
  function asyncFetchSharedData(extra) {
    var sharedDataReq = new XMLHttpRequest();
    sharedDataReq.onreadystatechange = function() {
          if (sharedDataReq.readyState === 4) {
            if(sharedDataReq.status === 200){
              var sharedData = JSON.parse(sharedDataReq.responseText);
              window.__initialDataLoaded(sharedData, extra);
            }
          }
        }
    sharedDataReq.open('GET', '/data/shared_data/', true);
    sharedDataReq.send(null);
  }
  function notifyLoaded(item, data) {
    item.pending = false;
    item.data = data;
    for (var i = 0;i < item.waiting.length; ++i) {
      item.waiting[i].resolve(item.data);
    }
    item.waiting = [];
  }
  function notifyError(item, msg) {
    item.pending = false;
    item.error = new Error(msg);
    for (var i = 0;i < item.waiting.length; ++i) {
      item.waiting[i].reject(item.error);
    }
    item.waiting = [];
  }
  window.__initialDataLoaded = function(initialData, extraData) {
    if (extraData) {
      for (var key in extraData) {
        initialData[key] = extraData[key];
      }
    }
    notifyLoaded(window.__initialData, initialData);
  };
  window.__initialDataError = function(msg) {
    notifyError(window.__initialData, msg);
  };
  window.__additionalData = {};
  window.__pendingAdditionalData = function(paths) {
    for (var i = 0;i < paths.length; ++i) {
      window.__additionalData[paths[i]] = {
        pending: true,
        waiting: []
      };
    }
  };
  window.__additionalDataLoaded = function(path, data) {
    if (path in window.__additionalData) {
      notifyLoaded(window.__additionalData[path], data);
    } else {
      console.error('Unexpected additional data loaded "' + path + '"');
    }
  };
  window.__additionalDataError = function(path, msg) {
    if (path in window.__additionalData) {
      notifyError(window.__additionalData[path], msg);
    } else {
      console.error('Unexpected additional data encountered an error "' + path + '": ' + msg);
    }
  };
  
})();
</script><script type="text/javascript">

/*
 Copyright 2018 Google Inc. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
*/

(function(){function g(a,c){b||(b=a,f=c,h.forEach(function(a){removeEventListener(a,l,e)}),m())}function m(){b&&f&&0<d.length&&(d.forEach(function(a){a(b,f)}),d=[])}function n(a,c){function k(){g(a,c);d()}function b(){d()}function d(){removeEventListener("pointerup",k,e);removeEventListener("pointercancel",b,e)}addEventListener("pointerup",k,e);addEventListener("pointercancel",b,e)}function l(a){if(a.cancelable){var c=performance.now(),b=a.timeStamp;b>c&&(c=+new Date);c-=b;"pointerdown"==a.type?n(c,
a):g(c,a)}}var e={passive:!0,capture:!0},h=["click","mousedown","keydown","touchstart","pointerdown"],b,f,d=[];h.forEach(function(a){addEventListener(a,l,e)});window.perfMetrics=window.perfMetrics||{};window.perfMetrics.onFirstInputDelay=function(a){d.push(a);m()}})();
</script>
    
                <link href="/static/images/ico/apple-touch-icon-76x76-precomposed.png/666282be8229.png" rel="apple-touch-icon-precomposed" sizes="76x76"/>
                <link href="/static/images/ico/apple-touch-icon-120x120-precomposed.png/8a5bd3f267b1.png" rel="apple-touch-icon-precomposed" sizes="120x120"/>
                <link href="/static/images/ico/apple-touch-icon-152x152-precomposed.png/68193576ffc5.png" rel="apple-touch-icon-precomposed" sizes="152x152"/>
                <link href="/static/images/ico/apple-touch-icon-167x167-precomposed.png/4985e31c9100.png" rel="apple-touch-icon-precomposed" sizes="167x167"/>
                <link href="/static/images/ico/apple-touch-icon-180x180-precomposed.png/c06fdb2357bd.png" rel="apple-touch-icon-precomposed" sizes="180x180"/>
                
                    <link href="/static/images/ico/favicon-192.png/68d99ba29cc8.png" rel="icon" sizes="192x192"/>
                
            
            
                    <link color="#262626" href="/static/images/ico/favicon.svg/fc72dd4bfde8.svg" rel="mask-icon"/>
                  
                  <link href="/static/images/ico/favicon.ico/36b3ee2d91ed.ico" rel="shortcut icon" type="image/x-icon"/>
                
            
            
            
   

</head>
    <body class="" style="
    background: white;
">
        
 

</body></html>

我嘗試調用特定的 div 但它沒有用。 我還有其他使用 JSON 請求的方法,但我想知道如何改進這個版本。 提前致謝

我終於用 json 做到了

import requests
import html5lib
import json
import time
import csv
from bs4 import BeautifulSoup

def list_of_tags(tags):
    related_tags = []
    for el in tags:
        url = "https://www.instagram.com/explore/tags/"+ el +"/?__a=1"
        req = requests.get(url)
        data = json.loads(req.text)
        edges = data['graphql']['hashtag']['edge_hashtag_to_related_tags']['edges']
        for item in edges:
            related_tags.append(item['node']['name'])
    print(related_tags)

它將為您提供與您正在尋找的標簽相關的所有標簽

希望對某人有所幫助。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM