简体   繁体   中英

Scrape javascript table with beautifulSoup which loads table data everytime on click

I'm trying to scrape https://search-intelligence.co.uk/niche-finder/ . site contains javascript table which I'm trying to scrape using beautiful Soup. but I'm not getting in to table data. enter image description here

import requests
from bs4 import BeautifulSoup
  
url = 'https://search-intelligence.co.uk/niche-finder'
headers = {
   'value': 'application/json, text/javascript, */*; q=0.01',
   'accept': 'application/json, text/javascript, */*; q=0.01',
   'cookie': '_ga=GA1.3.280184480.1658298012; XSRF-TOKEN=eyJpdiI6ImxJbXN4NHpuRHRkSnB0RjNnMEU1cVE9PSIsInZhbHVlIjoiTEd6V3BibHEvTG84aCtkajIxQTJ4Szk4cTZQQ2dNODJmcHpSZkltK3FZQTJOOUIvSHJDU05EV3Ztd0tUMEJCbmhxTzVFSThoQ1NNaldDWWpGNEo1Z0ZsYjlUSTVEZVBJcGw5NmIrK2NqdHh2cURVTUJyQy9JcUdMYmNWZ1FwQ3MiLCJtYWMiOiI1ZmRjNDE4YWI4OTVhZDZkMjA1OTlhNGU5Zjk1YTNkMDQxNTQyNTc0MmU3MjhiMGE5NjM0YzFkY2Q0ZjQ1NmZjIiwidGFnIjoiIn0%3D; laravel_session=eyJpdiI6InJRSDVHMEFOY3BCcHk1OGNjY2srdGc9PSIsInZhbHVlIjoiS1pxbEJJNHZtT3ZuNkgzMTlhTG1CZXBwY3VYK3JIamNaajhSU1JXRmpwS3lFMElFenpVdVpNd0Q0VlRQdFdUYi9ndFRGQytJcUxIV1pCUmxpVkEzTVRX`enter code here`SzZiOWFXVCs0ZHhGVldxbTRucGJjTjRNM0tQWHg4NUNUdk9aZUNxaGIiLCJtYWMiOiIwZTU0OTMzZTM1MDZkZmI5MjdjZmIzNDczNzViZGI4YzM1ODdhN2RiYzk0YzUzMzI3Njg1MjE4MTlmYzg4MmVlIiwidGFnIjoiIn0%3D; _gid=GA1.3.1135025661.1662360353'
}
  
page = requests.get(url, headers=headers)

soup=BeautifulSoup(page.text, 'html.parser')

print (soup)

Am i missing something here? table contains 100rows and 3188 pages. when clicked the second page it loads another 100rows of data.

enter image description here

Here's the result that I'm getting output

Data in that table is being loaded dynamically, via an XHR call (you can see this by inspecting Network tab in browser's Dev tools). EDIT: the following code will pull all websites (100k at once):

import requests
import pandas as pd
from tqdm import tqdm

headers = {
    'x-requested-with': 'XMLHttpRequest',
    'referer': 'https://search-intelligence.co.uk/niche-finder/',
    'accept': 'application/json, text/javascript, */*; q=0.01',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

big_df = pd.DataFrame()
for x in tqdm(range(1, 5)):
    url = f'https://search-intelligence.co.uk/niche-finder/data-table?draw={x}&columns%5B0%5D%5Bdata%5D=domain&columns%5B0%5D%5Bname%5D=domain&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=import_date&columns%5B1%5D%5Bname%5D=import_date&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=source&columns%5B2%5D%5Bname%5D=source&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=referring_domains&columns%5B3%5D%5Bname%5D=referring_domains&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=dr&columns%5B4%5D%5Bname%5D=dr&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=traffic&columns%5B5%5D%5Bname%5D=traffic&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=pages&columns%5B6%5D%5Bname%5D=pages&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=0&order%5B0%5D%5Bdir%5D=asc&start=0&length=100000&search%5Bvalue%5D=&search%5Bregex%5D=false&_=1662370391010'

    r = requests.get(url, headers=headers)
    df = pd.json_normalize(r.json()['data'])
    big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
big_df = big_df.drop_duplicates()
print(big_df)
big_df.to_csv('all_that_jazz.csv')

This will print out in terminal:

    id  domain  import_date processed   processed_date  source  referring_domains   dr  traffic created_at  updated_at  pages
0   66064   0-60specs.com   2022-05-10 15:53:50 1   2022-07-18T21:27:02.000000Z janetpanic-competitors  2,156   29  70,448  2022-05-10T15:53:50.000000Z 2022-07-18T21:27:02.000000Z 2,633
1   115705  0-jayparts.com  2022-05-12 20:33:18 1   2022-07-18T03:52:10.000000Z coinchefs-competitors   129 7   65  2022-05-12T20:33:18.000000Z 2022-07-18T03:52:10.000000Z 360,422
2   238990  000webhost.com  2022-05-14 20:56:52 1   2022-07-16T09:59:45.000000Z stackoverflow-competitors   110,073 91  20,503  2022-05-14T20:56:52.000000Z 2022-07-16T09:59:45.000000Z 752,768
3   120971  000webhostapp.com   2022-05-14 15:02:26 1   2022-07-18T02:02:28.000000Z enduringworld-competitors   161,883 89  2,266   2022-05-14T15:02:26.000000Z 2022-07-18T02:02:28.000000Z 8,031,647
4   160779  001.com.ua  2022-05-14 16:42:57 1   2022-07-17T12:32:44.000000Z flightpedia-competitors 198 5   1   2022-05-14T16:42:57.000000Z 2022-07-17T12:32:44.000000Z 54,885
... ... ... ... ... ... ... ... ... ... ... ... ...
895 195393  10015.io    2022-05-14 17:30:52 1   2022-07-17T00:44:48.000000Z calculatorsoup-competitors  666 45  3,650   2022-05-14T17:30:52.000000Z 2022-07-17T00:44:48.000000Z 369
896 206115  1001albumsgenerator.com 2022-05-14 17:41:34 1   2022-07-16T21:08:14.000000Z azlyrics-competitors    415 23  71  2022-05-14T17:41:34.000000Z 2022-07-16T21:08:14.000000Z 18,779
897 6005    1001ebook.net   2022-04-01 08:44:19 1   2022-07-19T19:24:26.000000Z ezoic   113     44  2022-04-01T08:44:19.000000Z 2022-07-19T19:24:26.000000Z 222
898 202571  1001fonts.com   2022-05-14 17:37:19 1   2022-07-16T22:20:39.000000Z calculatorsoup-competitors  15,471  78  676,780 2022-05-14T17:37:19.000000Z 2022-07-16T22:20:39.000000Z 7,776,775
899 152162  1001freefonts.com   2022-05-14 16:35:22 1   2022-07-17T15:28:19.000000Z flightpedia-competitors 16,984  77  274,239 2022-05-14T16:35:22.000000Z 2022-07-17T15:28:19.000000Z 1,848,434
[....]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM