AWS Aurora, as AWS describes, is a MySQL and PostgreSQL compatible relational database built for the cloud.
We moved one of our largest database to AWS Aurora in April 2017 and experienced a series of interesting events with it. So I think it’s time for me to write something about it, (right before reInvent 2017!!).
Note this post will be focus on AWS Aurora MySQL, I haven’t used Aurora PostgreSQL yet.
First of all, I have to say that I really like the design of Aurora, it innovatively replacing the storage layer of MySQL with a multi-tenant scale-out storage layer and did lots of other smart things which you can find out more from this paper so I won’t spend too much time there.
Although Aurora implements its storage layer in a distributed fault tolerant fashion, it still behaves like your favourite MySQL database and it is a perfect drop-in scalable replacement for your MySQL stores.
Aurora still only has one writer but you can add up to 15 readers to one single cluster, because storage layer is shared between writer and readers, the replica lag is minimized in most of the cases :-)
Those are enough good words, what I really want to share in this post are the issues we met with Aurora, some of them got solved eventually but some of them are ongoing.
This happened short after we moved to Aurora and were doing our first Aurora maintenance - upgrade Aurora version.
And tragically we found out, after we shutdown Aurora to restart it, it couldn’t standup anymore!
It took AWS and us quite some time to find out, this tragic was due to an ALTER TABLE statements we did before that corrupted the metadata written to the storage layer. The cluster was healthy before the restart because Aurora caches everything in its memory and didn’t load the corrupted metadata until a restart was initiated.
It is really an edge case but we are unfortunate enough to hit. AWS later fixed this in Aurora 1.15, the fix is described as
Fixed InnoDB metadata inconsistency during ALTER TABLE RENAME query which improved stability. Example: When columns of table t1(c1, c2) are renamed cyclically to t1(c2,c3) within the same ALTER statement.
As I mentioned and Auora proud of itself, writer and readers are sharing one distributed storage layer and writer is only passing page caches to readers which significantly reduces the IO between instances thus Aurora can provide much better write and read performance than traditional MySQL.
This characteristic of Aurora gives the readers an advantage that when they see a hugh replication lag, they can restart to recover. Because no data is passed from master, everything is in the storage layer, if a reader ever sees a 100 seconds replication lag, there must be something wrong by itself and the quickest way to recover itself, according to Aurora, is restart…
This reader auto recovery mechanism has been torturing us. When our writer workload went really high, readers restart because it couldn’t tell if it’s something wrong in themselves or it’s because writer failed to pass the freaking page caches in time. To be honest, neither do we.
Because AWS couldn’t figure out why their readers keep restarting, our services have to be more fault tolerant. So queries will be retried automatically when we found high replication lag and receive an connection error from Aurora. Circuit breaker wraps the connection so it won’t cause catastrophic damages, writer needs to slow down a bit when this happens etc.
Reader Endpoint Load Balancing
Aurora provides a reader endpoint which helps you balance the load to its reader instances, it behaves like a TCP ELB which probably really is what it is. It works very well in the ideal cases, load came, load balanced, load returned. Easy job, isn’t it?
But the fact that readers can restart broke the idealism. MySQL connections are persistent and clearly reader endpoing (ELB) balances the workload in a request based basis, so it will not dispatch requests to the newly restarted instances for quite sometime because clients still hold connections to other instances and it won’t create new connections until it reaches the limit of the connection pool so no load can actually be balanced by the ELB!
So we have to force timeout Aurora connections from client side periodically, basically refresh the connection to reader instances to ensure the load will be balanced.
If you are not familiar with it, Go’s
sql package has a function
SetConnMaxLifetime could help you with it.
OK, this is something really interesting. The tables we have in Aurora are quite large and that’s the main reason we moved from RDS MySQL to Aurora, however, we haven’t deleted the old table in RDS MySQL yet just for the sake of safety guard. And we have a particular query pattern which behaves very well in RDS MySQL but poorly in Aurora which is quite surprising because Aurora claims it has much better performance and in most of the cases it does.
I can’t share too much about the query itself here but it contains an IN clause and a time range constraint, for some reasons Aurora keeps choosing the wrong index for this query and we are planning to modify the value of
innodb_persistent_stats_sample_pages then see if it helps Aurora to figure better indexes, this is still a work in progress and I will try to share the optimization results later in the blog.
In summary, Aurora is good product and I think we will keep using it, I just hope AWS will give enough and long-lasting support to it because it is still a young database and needs lots of improvements.
And finally, I hope AWS will announce Aurora launching in Singapore during reInvent, we have been using it in Mumbai from Singapore for quite some time and the latency penalty is not just a bit :-(»
It has been a long while since the last time I sat down and let feelings flow from mind to paper.
2016 has been generally gorgeous so far, especially I moved from Singapore to Seattle in October and have not regretted the decision yet :-)
I visited Qinghai and Tibet in September, the journey includes a 2200KM road trip, 48 hours travel by train and 3 flghts across China, it was so fun and inspiring that no words can really describe it. So let me try with a few photos.
In terms of study and work, I have read some good books and tried lots of new things during 2016 that I really enjoyed and learnt a lot.
I have spent most of this year building varies data services and streaming platform in Grab and be honored to work with some of the most brilliant engineers I have ever met, we have managed to provide core data accesses to 90% of services in Grab with under 10 milliseconds latency averagely and stays reliable under 15000+ requests per second peak. The streaming platform we’ve built now produces hundreds of gigabytes of data per day and delivers billions of events to multiple subscribers in subsecond.
And this is just a beginning, hopefully we can keep learning, keep innovating and keep growing in 2017.
For some of you might interested in the books I found interesting and inspiring, here is list of them in different categories:
- Code Complete: A Practical Handbook of Software Construction, Second Edition
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
- Site Reliability Engineering: How Google Runs Production Systems
- Onward: How Starbucks Fought for Its Life without Losing Its Soul
- Too Big to Fail: The Inside Story of How Wall Street and Washington Fought to Save the Financial System–and Themselves
For those who find what we are doing in Grab interesting and want to explore more, we are actually hiring!»