From 2020 to early 2022, a lot of companies, from startups to Big Tech, are spending flagrantly. They spend a ton of money on hiring top talents, office perks, and tech infrastructure. All the big Cloud Providers are enjoying this moment. Everyone was hosting their app infrastructure in the cloud.
But the party was over towards the end of 2022. A lot of companies cut their spending. This is a journey where I cut our team’s Cloud Infrastructure cost. Toward the end, we cut the cloud bill by over 75%.
There are 3 interesting areas that we explored for cost savings. Database consolidation, Kubernetes cluster rightsizing, and autoscaling GitLab runner. In this article, I want to deep dive into the database consolidation process. I think the most challenging part of the journey.
For context, our backend infrastructure uses PostgreSQL for the database. My predecessor(s) designed the system with microservices architecture. For reasons unknown, the system evolved into a big pool of services, and each service has its database. With scaling in mind (I assume), it was set up so that each database has its own instance (CPU, memory, and disk) on our cloud provider (GCP). Additionally, each database has read replica, for even more scale (I think?). Everything is set up with Cloud SQL – the managed PostgreSQL product of Google Cloud.
Our backend had 9 database instances, each with read replicas (18 instances in total). To give an idea, GCP charges around $307 per month for an instance with 2 vCPU, 8 GB memory, and 100 GB disk. This is with high availability in our preferred region.
After getting access and all the credentials, I looked at the various metrics provided by the GCP. Utilization rate (CPU usage, memory usage, transaction/sec, etc). All metrics going back 6 months average 0.1x utilization, with the peak around 0.2. After spending more time with the team, I had a better understanding of our apps and services. I determined we don’t need this capacity.
My goal was to consolidate all our databases into a single Cloud SQL instance. It will have enough capacity for our production needs. It will be able to handle the peak traffic and be cost-efficient.
There’s no easy way to migrate from a Cloud SQL instance to an existing one. Google provides great tools to move data into or out of a Cloud SQL instance. But there’s no way to merge Cloud SQL instances into one. Such features might have been in their roadmap, but I don’t think we want to pay the bill for 18 instances while waiting.
With the goal set, I started researching. I explored different approaches to migrate PostgreSQL DB, with Cloud SQL restrictions in mind.For extra challenges (pain or fun, you tell me), I wanted to do it with minimal disruption for our customers. We want to have zero downtime or as little as possible. With production systems that handle financial transactions, this is the right approach.
After a ton of Google searches, I narrowed down our approach. We are going to use PostgreSQL logical replication tool called pglogical. This was my first time using this tool, but I know the theory enough to be able to wing it. We’ll improvise if we’re stuck. Since replication won’t impact the production data, it’s relatively safe.
After migrating 2-3 database instances, I was able to cut down the time from 2 weeks for 1 database, to 1 day for a database.
Here are the general steps.
Create a new database in the target instance
Cloud SQL has this easy menu on their web console to create a new database in an existing instance.
Create a replication user with the right credentials
CREATE USER replication_user WITH REPLICATION IN ROLE cloudsqlsuperuser LOGIN PASSWORD '<replication_user_password>';
Create the same database users in the target instance
This allows the app to connect to the database correctly without changing credentials. Cloud SQL also has this easy menu on their web console.
Export data from the source database to the target database
Again, Cloud SQL is nice in that it has an export and import button right up top.
Enable the pglogical plugin on all Cloud SQL instances
In Cloud SQL, I enabled the following database flags. Cloud SQL web console provides the value range as hints, I just used the minimum value (see utilization rate above for why I chose this). This needs a database instance reboot, so this is one of the downtimes that I had. I repeated this for all instances so I can control the downtime.
cloudsql.enable_pglogical cloudsql.logical_decoding max_replication_slots max_worker_processes max_wal_senders
Create pglogical extension
Do this as the
CREATE EXTENSION pglogical;
By this point, I have a copy of the
source_db in the
target_db. However, as new data is inserted into the
target_db starts lagging behind. Especially on tables that are write-heavy. I listed all my write-heavy tables in my notes for future use.
Create 2 pglogical nodes
One in the
source_db, and another in the
target_db. Use the internal IP address of the Cloud SQL instance to make sure they can connect to each other.
-- on source_db SELECT pglogical.create_node( node_name := 'provider', dsn := 'host=<source_db_ip_address> port=5432 dbname=<database_name> user=replication_user password=<source_db_replication_user_password>' ); -- on target_db SELECT pglogical.create_node( node_name := 'subscriber', dsn := 'host=<target_db_ip_address> port=5432 dbname=<database_name> user=replication_user password=<target_db_replication_user_password>' );
On the source_db, add all tables and sequences to the replica set
In my case, since I have everything on the
default schema, I use the following query.
SELECT pglogical.replication_set_add_all_tables('default', ARRAY['public']); SELECT pglogical.replication_set_add_all_sequences(set_name := 'default', schema_names := ARRAY['public'], synchronize_data := true); SELECT pglogical.synchronize_sequence( seqoid ) FROM pglogical.sequence_state;
target_db, create a pglogical subscription
This is one of the most complicated parts. Not because it’s hard, but it’s dependent on how the database is set up. Even with only 2 users, a few tables, and a few sequences in Postgres, it was a lot of trial and error for me.
-- on target_db SELECT pglogical.create_subscription( subscription_name := '<subscription_name>', provider_dsn := 'host=<source_db_ip_address> port=5432 dbname=<database_name> user=replication_user password=<source_db_replication_user_password>' );
Depending on the outcome, I checked whether your subscription was replicating or down.
-- on target_db SELECT * FROM pglogical.show_subscription_status('<subscription_name>');
Additionally, I often had to resynchronize the subscription, using this query:
SELECT * FROM pglogical.alter_subscription_synchronize('<subscription_name>'); SELECT * FROM pglogical.wait_for_subscription_sync_complete('<subscription_name>');
There are a few common issues that I encountered:
- Permission issue
For this, there are a couple of things I check. Basically, I want to make sure my
source_db_replication_userhas read privilege on all tables and sequences. Additionally,
target_db_replication_userhas write privilege on all tables and sequences. I checked the
pglogical.queuetable and the error log for hints on which table I need to grant privilege to.
- Duplicate key issue
This happens when my
target_dbcan't insert new data because the primary key (id) already exists. My solution was to truncate the table and resynchronize.
TRUNCATE <table_name> RESTART IDENTITY, recreate the subscription, then
SELECT * FROM pglogical.alter_subscription_resynchronize_table(<subscription_name>, <table_name>)
- Table out-of-sync issue
You can use the same
pglogical.alter_subscription_resynchronize_tableas above. I do this for write-heavy tables that I noted above.
- Connection timeout issue
For this, I changed the database instance flag from the Cloud SQL web console and set
wal_sender_timeoutto 5 minutes.
Make sure I have all tables have the same row count, and that the most recent rows are replicated. Additionally, validate sequences, indexes, views, and functions (if my database has them).
By the end of this step, I had 2 databases in 2 separate Cloud SQL instances, replicating.
After everything is set up and the pglogical subscription is replicating, it’s time to update the app to use the
Since we use Kubernetes, this step is simple. I changed the configuration of the app to point to the
target_db IP address. Then, I initiated a rolling restart. This ensures all ongoing requests are completed, and new requests go to new pods. The new pods will connect to
target_db. Like all things migration, the best practice is to choose the time when traffic is low.
As with all config changes, it is wise to observe error logs of the impacted systems. I found a couple of issues in my switchovers:
- Database permission issue.
Different from the replication permission issue above, my app does not use the replication user’s credentials. My app user didn’t have permission to a couple of tables in
target_db. I solved it by granting the appropriate permissions (read and write). A quick Google search gave me the Postgres command to do it.
- Duplicate keys error.
This is an issue with the Postgres sequence. In my case, my sequence didn’t have the right next value, so inserts were failing because new data didn’t have a unique id. I fixed it with a simple Postgres query
SELECT setval('table_name_id_seq', (SELECT MAX(id) FROM table_name)+1);
By the end of this step, I observed that the
source_db has little to no traffic. I observed the following metrics: transaction/sec, disk read, disk write, and connections.
First, on the
target_db, drop the pglogical subscription and node.
SELECT * FROM pglogical.drop_subscription('<subscription_name>') SELECT * FROM pglogical.drop_node('subscriber');
Then, I just stop and delete the
source_db instance. This is the most satisfying, yet scary part. I validated that all the data are copied over, and services are working fine with the new database. Yet, it is still scary to delete a database instance. I stopped the instance, and delete the instance in the GCP console.
I observed the billing usage a couple of days after to see the satisfying drop in daily cost.
I go back to step 1, and repeat several more times, for all our migration targets.
Going from 9 instances with read replicas to a single instance is a significant cost savings. I choose high availability so that GCP hosts it in multiple zones. This is still a production system, and we need the availability.
With this approach, I managed to mitigate the downtime to a single database restart per instance. This is because the PostgreSQL instance requires a restart to enable the pglogical plugin.
Besides cost savings, I learned a lot about cloud infrastructure, Cloud SQL, and PostgreSQL. Their characteristics, limitation, and constraints are interesting. I learned more about capacity planning and the tradeoff between scale and cost.