Migrating: Cassandra

Wed, Nov 11, 2020

Rational

Cassandra is a distributed NoSQL database. It is running on many production systems and in many of them a cluster might be needed to be migrated to different or bigger machines, or even in a different cloud or dc. Fortunately Cassandras distrbutes architecture makes a migration possible with almost no downtime.

Let’s start

For this test we have 2 Clusters configured and running the same C* version. It is important that those “Clusters” are configured with the same Cluster name but with different Datacenter name.

Clusters

Existing cluster:

Clustername: Cluster1
Datacenter: DC1
rack: RAC1
nodes: 5
replication: 3
ips: 10.10.10.{1,2,3}
seed: 10.10.0.1,10.10.0.3

New cluster:

ClusterName: Cluster1*
Datacenter: DC2
rack: RAC1
nodes: 5
replication: 3
ips: 10.11.10.{1,2,3}
seeds: 10.10.10.40,1.10.10.3

Check replication and datacenter details for all keyspaces.

cqlsh> SELECT * FROM system.schema_keyspaces  ;

 keyspace_name | durable_writes | strategy_class                                       | strategy_options
---------------+----------------+------------------------------------------------------+------------------
     Keyspace1 |           True | org.apache.cassandra.locator.NetworkTopologyStrategy |      {"DC1":"3"}
    dse_system |           True |      org.apache.cassandra.locator.EverywhereStrategy |               {}
     Keyspace2 |           True | org.apache.cassandra.locator.NetworkTopologyStrategy |      {"DC1":"3"}
     Keyspace3 |           True | org.apache.cassandra.locator.NetworkTopologyStrategy |      {"DC1":"3"}
        system |           True |           org.apache.cassandra.locator.LocalStrategy |               {}
      dse_perf |           True | org.apache.cassandra.locator.NetworkTopologyStrategy |      {"DC1":"3"}
 system_traces |           True | org.apache.cassandra.locator.NetworkTopologyStrategy |      {"DC1":"3"}
     Keyspace4 |           True | org.apache.cassandra.locator.NetworkTopologyStrategy |      {"DC1":"3"}
     Keyspace5 |           True | org.apache.cassandra.locator.NetworkTopologyStrategy |      {"DC1":"3"}

If a keyspace is not replicated, we should alter its replication settings as follows:

ALTER KEYSPACE Keyspace1 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'}  AND durable_writes = true;
ALTER KEYSPACE Keyspace2 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'}  AND durable_writes = true;
ALTER KEYSPACE Keyspace3 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'}  AND durable_writes = true;
ALTER KEYSPACE Keyspace4 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'}  AND durable_writes = true;
ALTER KEYSPACE Keyspace5 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'}  AND durable_writes = true;
ALTER KEYSPACE system_traces WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'}  AND durable_writes = true;
ALTER KEYSPACE dse_perf WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'}  AND durable_writes = true;
ALTER KEYSPACE dse_system WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'}  AND durable_writes = true;

Expand Cluster

Start each new node and wait for to join the cluster under the new DC2 datacenter.

root@new-dse-0:~# systemctl start dse

Some times systemctl timeout is less than what the cassandra need to come online. So it might look like the start failed but in reality the command just died waiting, while the node started with success.Be sure to check the logs and also systemctl start dse again so systemd knows that the service is running.

Some funny comments from other engineers who had problems with that.

Job for dse.service failed because the control process exited with error code. See "systemctl status dse.service" and "journalctl -xe" for details.
This prick of a service says it failes and gives me this message while in reality it succeeds....this is very scumbag
11:36
and on journalctl
Oct 23 08:18:25 new-dse-0 dse[11273]: WARNING: Timed out while waiting for DSE to start. This does not mean that the service failed to
Oct 23 08:18:25 new-dse-0 dse[11273]:    ...fail!
11:37
it waits for the node to bootstrap and it timeouts...waiting because it takes a long time...but it succeeds...and everytime I get a heartattack

On each new node fix the dse_system keyspace replication settings.

ALTER KEYSPACE dse_system WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'}  AND durable_writes = true;

Check the cluster status with nodetool status to see all nodes have joined under the DC2 datacenter

root@new-dse-0:# nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.10.10.1  1.93 MB    256     100.0%            fd3661db-03de-4c3e-889a-bdf29d8b15d9  RAC1
UN  10.10.10.2  1.94 MB    256     100.0%            1e0f18f8-970q-4559-a291-1d0e1035a4b3  RAC1
UN  10.10.10.3  1.94 MB    256     100.0%            4e0c18a8-370d-4559-a291-1d0e1035a4b3  RAC1

Datacenter: DC2
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.11.10.1  240.02 KB  256     0.0%              ef90a431-5f75-4f46-b47a-c1e58f73607f  RAC1
UN  10.11.10.2  118.82 KB  256     0.0%              a9941e4f-b16c-4293-bb35-a6c57218aa50  RAC1
UN  10.11.10.3  177.08 KB  256     0.0%              f392d17b-cf30-47e8-a641-5a3d59935db4  RAC1

Expand keyspaces

Expand keyspaces with 3 replicas for the new DC2 datacenter.

ALTER KEYSPACE keyspace1 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE keyspace2 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE keyspace3 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE keyspace4 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE keyspace5 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE system_traces WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE dse_perf WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE dse_system WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'}  AND durable_writes = true;

Rebuild new nodes

Before we start rebuilding each new node we might need to think our inter dc streaming limit. By default it should be low enough to cause no issues. But if time is at the essence and the source cluster is fast enought we can bump this limit up with the following command.

Inter DC streaming.

nodetool setinterdcstreamthroughput 50

Inter node streaming.

nodetool setstreamthroughput 50

Now we rebuild each node. This may take a lot of time based on throttling/load and cluster capacity. So basically no auto_boostrap is configured by default for nodes, meaning they will not start replicating by default. So after joining the cluster and expanding the keyspaces for 2 datacenters, we need to nodetool rebuild -dc on each node.

root@new-dse-0:# nodetool rebuild -dc DC1

This might take hours or days based on the cluster size and performance. I would suggest to build one node at a time, and possibly use screen o tmux for running the command.

We can monitor streaming with the follwing comand from all nodes.

root@new-dse-0:~# nodetool netstats
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 23488668
Mismatch (Blocking): 0
Mismatch (Background): 78349
Pool Name                    Active   Pending      Completed   Dropped
Commands                        n/a         0      342261203         9
Responses                       n/a         0      230261668       n/a

Repair

A good practice is to repair each new node after all nodes were rebuild. After we repair we can also cleanup. We can do that for all keyspaces or per keyspace by specifying the name after the command.

root@new-dse:# nodetool repair

root@new-dse:# nodetool cleanup

Shrink Keyspaces

After rebuilding/repairing has finished on all nodes, shrink keyspaces to the new datacenter, so no writes happen on old DC1 any more.

ALTER KEYSPACE keyspace1 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE keyspace2 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE keyspace3 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE keyspace4 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE keyspace5 WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE system_traces WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE dse_perf WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'}  AND durable_writes = true;
ALTER KEYSPACE dse_system WITH replication = {'class': 'NetworkTopologyStrategy', 'DC2': '3'}  AND durable_writes = true;

Shrink cluster

Decommssion old nodes one by one using the following command. Run this on every DC1 node.

root@old-dse-0# nodetool decommission

Watch them leave cluster with nodetool status

root@new-dse:# nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens  Owns (effective)  Host ID                               Rack
UL  10.10.10.1  9.26 MB    256     0.0%              8e0b18c8-970c-4559-a291-1d0e1035a4b3  RAC1
Datacenter: DC2
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.11.10.1  240.02 KB  256     0.0%              ef90a431-5f75-4f46-b47a-c1e58f73607f  RAC1
UN  10.11.10.2  118.82 KB  256     0.0%              a9941e4f-b16c-4293-bb35-a6c57218aa50  RAC1
UN  10.11.10.3  177.08 KB  256     0.0%              f392d17b-cf30-47e8-a641-5a3d59935db4  RAC1

General Notes and commands

How to generate load

cassandra-stress write -node 10.11.10.1

Some notes from forums on C*

First, if you run nodetool repair again and very little data is transferred (assuming all nodes have been up since the last time you ran), you know that the data is almost perfectly in sync. You can look at the logs to see numbers on how much data is transferred during this process.

Second, you can verify that all of the nodes are getting a similar number of writes by looking at the write counts with nodetool cfstats. Note that the write count value is reset each time Cassandra restarts, so if they weren't restarted around the same time, you'll have to see how quickly they are each increasing over time.

Last, if you just want to spot check a few recently updated values, you can try reading those values at consistency level ONE. If you always get the most up-to-date version of the data, you'll know that the replicas are likely in sync.

As a general note, replication is such an ingrained part of Cassandra that it's extremely unlikely to fail on its own without you noticing. Typically a node will be marked down shortly after problems start. Also, I'm assuming you're writing at consistency level ONE or ANY; with anything higher, you know for sure that both of the replicas have received the write.

parask.me