简体   繁体   中英

Fastest way to retrieve data from API

I am using this API to retrieve around 24.000 items from it.

So first I take the items list from here (Warning, slow browsers may crash).

Then I loop over all the items and find all the info for each item. Something like:

https://api.guildwars2.com/v2/items/itemidhere

and the insert the info into MySQL database.

PD: real question starts after this line below.


Im trying to find the fastest way to get the info from those links and insert it. For this I am using:

-GSON library (easiest and fastest way to control JSON)
-HikariCP (for database connection pools)
-Threads (each thread takes care of 1000 items if there are 24 threads)

I did some tests and here are the results for collecting and inserting the 24.000 items:

-Threads: 50
-DB Pool size: 10
-Time: 644 seconds

-Threads: 100
-DB Pool size: 10
-Time: 607 seconds

-Threads: 250
-DB Pool size: 15
-Time: 662 seconds

-Threads: 500
-DB Pool size: 20
-Time: 689 seconds

I know the slowest thing here is the network.
My computer and internet arent that slow:
-300mb/s internet
在此处输入图片说明
-Intel 5820k
-16GB DDR4

So whats left may be the code implementation...

    HikariConfig config = new HikariConfig();
    config.setDriverClassName("com.mysql.jdbc.Driver");
    config.setJdbcUrl("jdbc:mysql://localhost:3306/mydb");
    config.setUsername("root");
    config.setPassword("none");
    config.addDataSourceProperty("cachePrepStmts", "true");
    config.addDataSourceProperty("prepStmtCacheSize", "250");
    config.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");
    config.setMaximumPoolSize(depending on case);
    Core.ds = new HikariDataSource(config);

This is the setup for the database connection pool. I start the threads with a Cyclic barrier:

    final CyclicBarrier _threadGate = new CyclicBarrier(depends on case);
    ArrayList<Thread> _threadList = new ArrayList<>();

And then

     _threadList.add(new Thread() {
            @Override
            public void run() {
                try {
                    _threadGate.await();
                    //Parsing happens a bit later

Here I loop through all the list and get the info from the url (I skipped variable declaration):

                        _id = _itemList.get(i);
                        _stringUrl = "https://api.guildwars2.com/v2/items/" + _id;
                        _responseText = new URL(_stringUrl);
                        _requestUrl = (HttpURLConnection) _responseText.openConnection();
                        _requestUrl.connect();
                        _requestStatus = _requestUrl.getResponseCode();
                        if(_requestStatus == 200){
                           _jsonParser = new JsonParser();
                            _rootElement = _jsonParser.parse(new InputStreamReader((InputStream) _requestUrl.getContent(), "UTF-8"));
                            _rootObject = _rootElement.getAsJsonObject();

And the from the _rootObject do a lot of parsing and checking if json exits etc etc... And at the end the insertion...

Here is how I start the threads after everything is processed in the main class:

    for (int i = 0; i < _threadList.size(); i++) {
        _threadList.get(i).start();
    }

INFO: here about why didnt I use bigger pool size.

What i DO NOT understand is:
-why if there are more threads the result is slower
-I mean,cmon maybe the network is slow but, few requests could fill 300mb/s?
-would implementing a better code make this faster?

I actually see it like this:
-more threads -> slower internet, which makes pulling info slow.
-bigger pool size -> slower insertion due to many connections
-more threads and small connection pool -> inserts queued and stalled
-few threads and small connection pool -> slow pull info

Updates
-tried classic connection over pooled, 1 connection/pool and results are slower, like 30 seconds slower
-tried ExecutorService over CyclicBarrier, results slower by 10 seconds.

(Too many questions for a 'comment.)

I'm confused -- you are "retrieving data" and you are "inserting" it. Which side should we focus on? Which side do you have control over?

You are getting only 30-40 rows inserted per second? That is pathetic.

Let's focus on how you are doing the INSERTs into the MySQL table. Please provide SHOW CREATE TABLE -- I need to see the Engine and indexes and other stuff. Please provide some clues about the INSERTs -- One row at a time vs batched (10x improvement here)? Sequential versus random PRIMARY KEY ? How big are the tables? How big is the buffer_pool? What version of MySQL (newer versions have some extra tricks)?

There is some contention between threads, so "too many" threads can actually slow down activity. But I think that is a secondary issue.

What percent of the 306Mb/s is consumed?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM