I'm trying to scrape https://search-intelligence.co.uk/niche-finder/ . site contains javascript table which I'm trying to scrape using beautiful Soup. but I'm not getting in to table data. enter image description here
import requests
from bs4 import BeautifulSoup
url = 'https://search-intelligence.co.uk/niche-finder'
headers = {
'value': 'application/json, text/javascript, */*; q=0.01',
'accept': 'application/json, text/javascript, */*; q=0.01',
'cookie': '_ga=GA1.3.280184480.1658298012; XSRF-TOKEN=eyJpdiI6ImxJbXN4NHpuRHRkSnB0RjNnMEU1cVE9PSIsInZhbHVlIjoiTEd6V3BibHEvTG84aCtkajIxQTJ4Szk4cTZQQ2dNODJmcHpSZkltK3FZQTJOOUIvSHJDU05EV3Ztd0tUMEJCbmhxTzVFSThoQ1NNaldDWWpGNEo1Z0ZsYjlUSTVEZVBJcGw5NmIrK2NqdHh2cURVTUJyQy9JcUdMYmNWZ1FwQ3MiLCJtYWMiOiI1ZmRjNDE4YWI4OTVhZDZkMjA1OTlhNGU5Zjk1YTNkMDQxNTQyNTc0MmU3MjhiMGE5NjM0YzFkY2Q0ZjQ1NmZjIiwidGFnIjoiIn0%3D; laravel_session=eyJpdiI6InJRSDVHMEFOY3BCcHk1OGNjY2srdGc9PSIsInZhbHVlIjoiS1pxbEJJNHZtT3ZuNkgzMTlhTG1CZXBwY3VYK3JIamNaajhSU1JXRmpwS3lFMElFenpVdVpNd0Q0VlRQdFdUYi9ndFRGQytJcUxIV1pCUmxpVkEzTVRX`enter code here`SzZiOWFXVCs0ZHhGVldxbTRucGJjTjRNM0tQWHg4NUNUdk9aZUNxaGIiLCJtYWMiOiIwZTU0OTMzZTM1MDZkZmI5MjdjZmIzNDczNzViZGI4YzM1ODdhN2RiYzk0YzUzMzI3Njg1MjE4MTlmYzg4MmVlIiwidGFnIjoiIn0%3D; _gid=GA1.3.1135025661.1662360353'
}
page = requests.get(url, headers=headers)
soup=BeautifulSoup(page.text, 'html.parser')
print (soup)
Am i missing something here? table contains 100rows and 3188 pages. when clicked the second page it loads another 100rows of data.
Here's the result that I'm getting output
Data in that table is being loaded dynamically, via an XHR call (you can see this by inspecting Network tab in browser's Dev tools). EDIT: the following code will pull all websites (100k at once):
import requests
import pandas as pd
from tqdm import tqdm
headers = {
'x-requested-with': 'XMLHttpRequest',
'referer': 'https://search-intelligence.co.uk/niche-finder/',
'accept': 'application/json, text/javascript, */*; q=0.01',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
big_df = pd.DataFrame()
for x in tqdm(range(1, 5)):
url = f'https://search-intelligence.co.uk/niche-finder/data-table?draw={x}&columns%5B0%5D%5Bdata%5D=domain&columns%5B0%5D%5Bname%5D=domain&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=import_date&columns%5B1%5D%5Bname%5D=import_date&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=source&columns%5B2%5D%5Bname%5D=source&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=referring_domains&columns%5B3%5D%5Bname%5D=referring_domains&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=dr&columns%5B4%5D%5Bname%5D=dr&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=traffic&columns%5B5%5D%5Bname%5D=traffic&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=pages&columns%5B6%5D%5Bname%5D=pages&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=0&order%5B0%5D%5Bdir%5D=asc&start=0&length=100000&search%5Bvalue%5D=&search%5Bregex%5D=false&_=1662370391010'
r = requests.get(url, headers=headers)
df = pd.json_normalize(r.json()['data'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
big_df = big_df.drop_duplicates()
print(big_df)
big_df.to_csv('all_that_jazz.csv')
This will print out in terminal:
id domain import_date processed processed_date source referring_domains dr traffic created_at updated_at pages
0 66064 0-60specs.com 2022-05-10 15:53:50 1 2022-07-18T21:27:02.000000Z janetpanic-competitors 2,156 29 70,448 2022-05-10T15:53:50.000000Z 2022-07-18T21:27:02.000000Z 2,633
1 115705 0-jayparts.com 2022-05-12 20:33:18 1 2022-07-18T03:52:10.000000Z coinchefs-competitors 129 7 65 2022-05-12T20:33:18.000000Z 2022-07-18T03:52:10.000000Z 360,422
2 238990 000webhost.com 2022-05-14 20:56:52 1 2022-07-16T09:59:45.000000Z stackoverflow-competitors 110,073 91 20,503 2022-05-14T20:56:52.000000Z 2022-07-16T09:59:45.000000Z 752,768
3 120971 000webhostapp.com 2022-05-14 15:02:26 1 2022-07-18T02:02:28.000000Z enduringworld-competitors 161,883 89 2,266 2022-05-14T15:02:26.000000Z 2022-07-18T02:02:28.000000Z 8,031,647
4 160779 001.com.ua 2022-05-14 16:42:57 1 2022-07-17T12:32:44.000000Z flightpedia-competitors 198 5 1 2022-05-14T16:42:57.000000Z 2022-07-17T12:32:44.000000Z 54,885
... ... ... ... ... ... ... ... ... ... ... ... ...
895 195393 10015.io 2022-05-14 17:30:52 1 2022-07-17T00:44:48.000000Z calculatorsoup-competitors 666 45 3,650 2022-05-14T17:30:52.000000Z 2022-07-17T00:44:48.000000Z 369
896 206115 1001albumsgenerator.com 2022-05-14 17:41:34 1 2022-07-16T21:08:14.000000Z azlyrics-competitors 415 23 71 2022-05-14T17:41:34.000000Z 2022-07-16T21:08:14.000000Z 18,779
897 6005 1001ebook.net 2022-04-01 08:44:19 1 2022-07-19T19:24:26.000000Z ezoic 113 44 2022-04-01T08:44:19.000000Z 2022-07-19T19:24:26.000000Z 222
898 202571 1001fonts.com 2022-05-14 17:37:19 1 2022-07-16T22:20:39.000000Z calculatorsoup-competitors 15,471 78 676,780 2022-05-14T17:37:19.000000Z 2022-07-16T22:20:39.000000Z 7,776,775
899 152162 1001freefonts.com 2022-05-14 16:35:22 1 2022-07-17T15:28:19.000000Z flightpedia-competitors 16,984 77 274,239 2022-05-14T16:35:22.000000Z 2022-07-17T15:28:19.000000Z 1,848,434
[....]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.