简体   繁体   中英

speed up operation on mysql

I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?

I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here . There is a recording on the mysql webpage here (have to register for mysql.com though).

Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.

As you have a single database no amount of parallelism or clustering on the application side will make much difference.

So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.

If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.

Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.

Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.

The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.

You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.

I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.

I could be wrong, but it's worth a prototype to see.

Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, ie only one can work on it at a time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM