简体   繁体   中英

Can't scrape all the company names from a webpage

I'm trying to parse all the company names from this webpage . There are around 2431 companies in there. However, the way I've tried below can fetches me 1000 results.

This is what I can see about the number of results in response while going through dev tools:

hitsPerPage: 1000
index: "YCCompany_production"
nbHits: 2431      <------------------------       
nbPages: 1
page: 0

How can I get the rest of the results using requests?

I've tried so far:

import requests

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?'

params = {
    'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
    'x-algolia-application-id': '45BWZJ1SGC',
    'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJlc3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUyMiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
payload = {"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query=&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    r = s.post(url,params=params,json=payload)
    print(len(r.json()['results'][0]['hits']))

As a workaround you can simulate search using alphabet as a search pattern. Using code below you will get all 2431 companies as dictionary with ID as a key and full company data dictionary as a value.

import requests
import string

params = {
    'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
    'x-algolia-application-id': '45BWZJ1SGC',
    'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJl'
                         'c3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUy'
                         'MiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries'
result = dict()
for letter in string.ascii_lowercase:
    print(letter)

    payload = {
        "requests": [{
            "indexName": "YCCompany_production",
            "params": "hitsPerPage=1000&query=" + letter + "&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="
        }]
    }

    r = requests.post(url, params=params, json=payload)
    result.update({h['id']: h for h in r.json()['results'][0]['hits']})

print(len(result))

UPDATE 01-04-2021

After reviewing the "fine print" in the Algolia API documentation , I discovered that the paginationLimitedTo parameter CANNOT BE USED in a query. This parameter can only be used during indexing by the data's owner.

It seems that you can use the query and offset this way:

payload = {"requests":[{"indexName":"YCCompany_production",
                        "params": "query=&offset=1000&length=500&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit"
                                 "%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}

Unfortunately, the paginationLimitedTo index set by the customer will not let you retrieve more than 1000 records via the API.

"hits": [],
    "nbHits": 2432,
    "offset": 1000,
    "length": 500,
    "message": "you can only fetch the 1000 hits for this query. You can extend the number of hits returned via the paginationLimitedTo index parameter or use the browse method. You can read our FAQ for more details about browsing: https://www.algolia.com/doc/faq/index-configuration/how-can-i-retrieve-all-the-records-in-my-index",

The browsing bypass method mentioned requires an ApplicationID and the AdminAPIKey


ORIGINAL POST

Based on the Algolia API documentation there is a query hit limit of 1000.

The documentation lists several ways to override or bypass this limit.

Part of the API is paginationLimitedTo , which by default is set to 1000 for performance and "scraping protection."

The syntax is:

'paginationLimitedTo': number_of_records

Another method mentioned in the documentation is setting the parameters offset and length.

offset lets you specify the starting hit (or record)

length sets the number of records returned

You could use these parameters to walk the records, thus potentially not impacting your scraping performance.

For instance you could scrape in blocks of 500.

  • records 1-500 (offset=0 and length=500)
  • records 501-1001 (offset=500 and length=500)
  • records 1002-1502 (offset=1001 and length=500)
  • etc...

or

  • records 1-500 (offset=0 and length=500)
  • records 500-1000 (offset=499 and length=500)
  • records 1000-1500 (offset=999 and length=500)
  • etc...

The latter one would produces a few duplicates, which could be easily removed when adding them to your in-memory storage (list, dictionary, dataframe).

----------------------------------------
My system information
----------------------------------------
Platform:    macOS
Python:      3.8.0
Requests:    2.25.1
----------------------------------------

Try an explicit limit value in the payload to override the API default. For instance, insert limit=2500 into your request string.

Looks like you need to set the param like this to override defaults. With

   index.set_settings

  'paginationLimitedTo': number_of_records

Example use for Pyhton.

 index.set_settings({'customRanking': ['desc(followers)']})

Further Info:- https://www.algolia.com/doc/api-reference/api-methods/set-settings/#examples

There are other way to solve this problem. First you can add &filters=objectID:SomeIds .
Algolia allows you to send 1000 different queries in one request . This body will return you two objects:
{"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query&filters=objectID:271"}, {"indexName":"YCCompany_production","params":"hitsPerPage=1000&query&filters=objectID:5"}]}
You can check objectID values. Where are some range from 1-30000. Just send random objectIDs from 1-30000 and with only 30 request you will get all 3602 companies.

Here you have my java code:

    public static void main(String[] args) throws IOException {
        System.out.println("Start scraping content...>> " + new Timestamp(new Date().getTime()));
        Set<Integer> allIds = new HashSet<>();
        URL target = new URL("https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%3B%20JS%20Helper%20(3.7.0)&x-algolia-application-id=45BWZJ1SGC&x-algolia-api-key=Zjk5ZmFjMzg2NmQxNTA0NGM5OGNiNWY4MzQ0NDUyNTg0MDZjMzdmMWY1NTU2YzZkZGVmYjg1ZGZjMGJlYjhkN3Jlc3RyaWN0SW5kaWNlcz1ZQ0NvbXBhbnlfcHJvZHVjdGlvbiZ0YWdGaWx0ZXJzPSU1QiUyMnljZGNfcHVibGljJTIyJTVEJmFuYWx5dGljc1RhZ3M9JTVCJTIyeWNkYyUyMiU1RA%3D%3D");
        String requestBody = "{\"requests\":[{\"indexName\":\"YCCompany_production\",\"params\":\"hitsPerPage=1000&query&filters=objectID:24638\"}]}";
        int index = 1;
        List<String> results = new ArrayList<>();
        String bodyIndex = "{\"indexName\":\"YCCompany_production\",\"params\":\"hitsPerPage=1000&query&filters=objectID:%d\"}";
        for (int i = 1; i <= 30; i++) {
            StringBuilder body = new StringBuilder("{\"requests\":[");
            for (int j = 1; j <= 1000; j++) {
                body.append(String.format(bodyIndex, index));
                body.append(",");
                index++;
            }
            body = new StringBuilder(body.substring(0, body.length() - 1));
            body.append("]}");
            HttpURLConnection con = (HttpURLConnection) target.openConnection();
            con.setDoOutput(true);
            con.setRequestMethod(HttpMethod.POST.name());
            con.setRequestProperty(HttpHeaders.CONTENT_TYPE, APPLICATION_JSON);
            OutputStream os = con.getOutputStream();
            os.write(body.toString().getBytes(StandardCharsets.UTF_8));
            os.close();
            con.connect();
            String response = new String(con.getInputStream().readAllBytes(), StandardCharsets.UTF_8);
            results.add(response);
        }
        results.forEach(result -> {
            JsonArray array = JsonParser.parseString(result).getAsJsonObject().get("results").getAsJsonArray();
            array.forEach(data -> {
                if (((JsonObject) data).get("nbHits").getAsInt() == 0) {
                    return;
                } else {
                    allIds.add(((JsonObject) data).get("hits").getAsJsonArray().get(0).getAsJsonObject().get("id").getAsInt());
                }
            });
        });
        System.out.println("Total scraped ids " + allIds.size());
        System.out.println("Finish scraping content...>>>> " + new Timestamp(new Date().getTime()));
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM