Migrating: Cassandra
Rational
Cassandra is a distributed NoSQL database. It is running on many production systems and in many of them a cluster might be needed to be migrated to different or bigger machines, or even in a different cloud or dc. Fortunately Cassandras distrbutes architecture makes a migration possible with almost no downtime.
Let’s start
For this test we have 2 Clusters configured and running the same C* version. It is important that those “Clusters” are configured with the same Cluster name but with different Datacenter name.
Clusters
Existing cluster:
Clustername: Cluster1
Datacenter: DC1
rack: RAC1
nodes: 5
replication: 3
ips: 10.10.10.{1,2,3}
seed: 10.10.0.1,10.10.0.3
New cluster:
ClusterName: Cluster1*
Datacenter: DC2
rack: RAC1
nodes: 5
replication: 3
ips: 10.11.10.{1,2,3}
seeds: 10.10.10.40,1.10.10.3
Check replication and datacenter details for all keyspaces.
cqlsh> SELECT * FROM system.schema_keyspaces ;
keyspace_name | durable_writes | strategy_class | strategy_options
---------------+----------------+------------------------------------------------------+------------------
Keyspace1 | True | org.apache.cassandra.locator.NetworkTopologyStrategy | {"DC1":"3"}
dse_system | True | org.apache.cassandra.locator.EverywhereStrategy | {}
Keyspace2 | True | org.apache.cassandra.locator.NetworkTopologyStrategy | {"DC1":"3"}
Keyspace3 | True | org.apache.cassandra.locator.NetworkTopologyStrategy | {"DC1":"3"}
system | True | org.apache.cassandra.locator.LocalStrategy | {}
dse_perf | True | org.apache.cassandra.locator.NetworkTopologyStrategy | {"DC1":"3"}
system_traces | True | org.apache.cassandra.locator.NetworkTopologyStrategy | {"DC1":"3"}
Keyspace4 | True | org.apache.cassandra.locator.NetworkTopologyStrategy | {"DC1":"3"}
Keyspace5 | True | org.apache.cassandra.locator.NetworkTopologyStrategy | {"DC1":"3"}
If a keyspace is not replicated, we should alter its replication settings as follows:
ALTER KEYSPACE Keyspace1 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true;
ALTER KEYSPACE Keyspace2 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true;
ALTER KEYSPACE Keyspace3 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true;
ALTER KEYSPACE Keyspace4 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true;
ALTER KEYSPACE Keyspace5 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true;
ALTER KEYSPACE system_traces WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true;
ALTER KEYSPACE dse_perf WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true;
ALTER KEYSPACE dse_system WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true;
Expand Cluster
Start each new node and wait for to join the cluster under the new DC2 datacenter.
root@new-dse-0:~# systemctl start dse
Some times systemctl timeout is less than what the cassandra need to come online. So it might look like the start failed but in reality the command
just died waiting, while the node started with success.Be sure to check the logs and also systemctl start dse
again so systemd knows that the service is running.
Some funny comments from other engineers who had problems with that.
Job for dse.service failed because the control process exited with error code. See "systemctl status dse.service" and "journalctl -xe" for details.
This prick of a service says it failes and gives me this message while in reality it succeeds....this is very scumbag
11:36
and on journalctl
Oct 23 08:18:25 new-dse-0 dse[11273]: WARNING: Timed out while waiting for DSE to start. This does not mean that the service failed to
Oct 23 08:18:25 new-dse-0 dse[11273]: ...fail!
11:37
it waits for the node to bootstrap and it timeouts...waiting because it takes a long time...but it succeeds...and everytime I get a heartattack
On each new node fix the dse_system keyspace replication settings.
ALTER KEYSPACE dse_system WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true;
Check the cluster status with nodetool status
to see all nodes have joined under the DC2 datacenter
root@new-dse-0:# nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.10.10.1 1.93 MB 256 100.0% fd3661db-03de-4c3e-889a-bdf29d8b15d9 RAC1
UN 10.10.10.2 1.94 MB 256 100.0% 1e0f18f8-970q-4559-a291-1d0e1035a4b3 RAC1
UN 10.10.10.3 1.94 MB 256 100.0% 4e0c18a8-370d-4559-a291-1d0e1035a4b3 RAC1
Datacenter: DC2
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.11.10.1 240.02 KB 256 0.0% ef90a431-5f75-4f46-b47a-c1e58f73607f RAC1
UN 10.11.10.2 118.82 KB 256 0.0% a9941e4f-b16c-4293-bb35-a6c57218aa50 RAC1
UN 10.11.10.3 177.08 KB 256 0.0% f392d17b-cf30-47e8-a641-5a3d59935db4 RAC1
Expand keyspaces
Expand keyspaces with 3 replicas for the new DC2 datacenter.
ALTER KEYSPACE keyspace1 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE keyspace2 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE keyspace3 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE keyspace4 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE keyspace5 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE system_traces WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE dse_perf WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE dse_system WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true;
Rebuild new nodes
Before we start rebuilding each new node we might need to think our inter dc streaming limit. By default it should be low enough to cause no issues. But if time is at the essence and the source cluster is fast enought we can bump this limit up with the following command.
Inter DC streaming.
nodetool setinterdcstreamthroughput 50
Inter node streaming.
nodetool setstreamthroughput 50
Now we rebuild each node. This may take a lot of time based on throttling/load and cluster capacity. So basically no auto_boostrap is configured by default for nodes, meaning they will not start replicating by default. So after joining the cluster and expanding the keyspaces for 2 datacenters, we need to nodetool rebuild -dc on each node.
root@new-dse-0:# nodetool rebuild -dc DC1
This might take hours or days based on the cluster size and performance. I would suggest to build one node at a time, and possibly use
screen
o tmux
for running the command.
We can monitor streaming with the follwing comand from all nodes.
root@new-dse-0:~# nodetool netstats
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 23488668
Mismatch (Blocking): 0
Mismatch (Background): 78349
Pool Name Active Pending Completed Dropped
Commands n/a 0 342261203 9
Responses n/a 0 230261668 n/a
Repair
A good practice is to repair each new node after all nodes were rebuild. After we repair we can also cleanup. We can do that for all keyspaces or per keyspace by specifying the name after the command.
root@new-dse:# nodetool repair
root@new-dse:# nodetool cleanup
Shrink Keyspaces
After rebuilding/repairing has finished on all nodes, shrink keyspaces to the new datacenter, so no writes happen on old DC1 any more.
ALTER KEYSPACE keyspace1 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE keyspace2 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE keyspace3 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE keyspace4 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE keyspace5 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE system_traces WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE dse_perf WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'} AND durable_writes = true;
ALTER KEYSPACE dse_system WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'} AND durable_writes = true;
Shrink cluster
Decommssion old nodes one by one using the following command. Run this on every DC1 node.
root@old-dse-0# nodetool decommission
Watch them leave cluster with nodetool status
root@new-dse:# nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UL 10.10.10.1 9.26 MB 256 0.0% 8e0b18c8-970c-4559-a291-1d0e1035a4b3 RAC1
Datacenter: DC2
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.11.10.1 240.02 KB 256 0.0% ef90a431-5f75-4f46-b47a-c1e58f73607f RAC1
UN 10.11.10.2 118.82 KB 256 0.0% a9941e4f-b16c-4293-bb35-a6c57218aa50 RAC1
UN 10.11.10.3 177.08 KB 256 0.0% f392d17b-cf30-47e8-a641-5a3d59935db4 RAC1
General Notes and commands
How to generate load
cassandra-stress write -node 10.11.10.1
Some notes from forums on C*
First, if you run nodetool repair again and very little data is transferred (assuming all nodes have been up since the last time you ran), you know that the data is almost perfectly in sync. You can look at the logs to see numbers on how much data is transferred during this process.
Second, you can verify that all of the nodes are getting a similar number of writes by looking at the write counts with nodetool cfstats. Note that the write count value is reset each time Cassandra restarts, so if they weren't restarted around the same time, you'll have to see how quickly they are each increasing over time.
Last, if you just want to spot check a few recently updated values, you can try reading those values at consistency level ONE. If you always get the most up-to-date version of the data, you'll know that the replicas are likely in sync.
As a general note, replication is such an ingrained part of Cassandra that it's extremely unlikely to fail on its own without you noticing. Typically a node will be marked down shortly after problems start. Also, I'm assuming you're writing at consistency level ONE or ANY; with anything higher, you know for sure that both of the replicas have received the write.