Simulating A Networking Failure
Using Kurtosis you can control which services can communicate inside and across subnetworks. Kurtosis provides arbitrarily fine grained control over which services can communicate with each other. This can be useful to simulate network faults, partition tolerance or even partial network outages, including package drops.
This guide illustrates the concept of Kurtosis subnetworks by simluating a partial networking failure within a distributed system.
Step 1: Start all services in an enclave
The example system is composed of a Cassandra cluster with 3 nodes, 2 redundant services serving the main application and a single port proxy to load balance the requests.
MAIN_SUBNETWORK = "main_subnetwork"
SECONDARY_SUBNETWORK = "secondary_subnetwork"
def run(args):
add_service(
service_name="cassandra_node_1",
config=ServiceConfig(
...
subnetwork=MAIN_SUBNETWORK,
),
)
add_service(
service_name="cassandra_node_2",
config=ServiceConfig(
...
subnetwork=MAIN_SUBNETWORK,
),
)
add_service(
service_name="cassandra_node_3",
config=ServiceConfig(
...
subnetwork=MAIN_SUBNETWORK,
),
)
add_service(
service_name="app_service_1",
config=ServiceConfig(
...
subnetwork=MAIN_SUBNETWORK,
),
)
add_service(
service_name="app_service_2",
config=ServiceConfig(
...
subnetwork=MAIN_SUBNETWORK,
),
)
add_service(
service_name="single_port_proxy",
config=ServiceConfig(
...
subnetwork=MAIN_SUBNETWORK,
),
)
When all the services are added to the Kurtosis enclave, it will look something like:
Kurtosis enclave
-----------------------------------------------------
| main_subnetwork |
|-----------------------------------------------------|
| |
| cassandra_node_1 |
| cassandra_node_2 |
| cassandra_node_3 |
| app_service_1 |
| app_service_2 |
| single_port_proxy |
| |
-----------------------------------------------------
At the start, all connections are allowed.
- all the Cassandra nodes can communicate with each other
- the two services can reach all Cassandra nodes
- the single port proxy is playing its role of dispatching requests to both
app_service_1
andapp_service_2
Setp 2: Add a second subnetwork by updating services
Subnetworks make it easy to simulate a network failure that completely isolates cassandra_node_2
and app_service_2
from the rest of the system.
First,cassandra_node_2
and app_service_2
need to be assigned to a distinct subnetwork, secondary_subnetwork
. This can be done by updating those services.
This can be done by "updating" those services.
MAIN_SUBNETWORK = "main_subnetwork"
SECONDARY_SUBNETWORK = "secondary_subnetwork"
def run(args):
# ...
# all the services are added using the code snippet above
# ...
update_service(
service_name="cassandra_node_2",
config=UpdateServiceConfig(
subnetwork=SECONDARY_SUBNETWORK,
),
)
update_service(
service_name="app_service_2",
config=UpdateServiceConfig(
subnetwork=SECONDARY_SUBNETWORK,
),
)
Kurtosis has the concept of a default connection between subnetworks. The default connection is the connection configuration that all pairs of subnetworks will inherit. At enclave creation, the default connection is set to allow communication between subnetworks. In this case, when the cassandra_node_2
and app_service_2
are assigned secondary_subnetwork
, nothing changes. main_subnetwork
and secondary_subnetwork
can continue to communicate because the default connection is unblocked.
Kurtosis enclave
-----------------------------------------------------
| main_subnetwork | secondary_subnetwork |
|-----------------------------------------------------|
| . |
| cassandra_node_1 . |
| . cassandra_node_2 |
| cassandra_node_3 . |
| app_service_1 . |
| . app_service_2 |
| single_port_proxy . |
| . |
-----------------------------------------------------
Setp 3: Configure the connection between the two subnetworks
The creation of main_subnetwork
and secondary_subnetwork
now makes it possible to reconfigure the connection between them. Communications between the two subnetworks can now be blocked partially or completely by overriding the default connection for those two specific subnetworks.
MAIN_SUBNETWORK = "main_subnetwork"
SECONDARY_SUBNETWORK = "secondary_subnetwork"
def run(args):
# ...
# all the services are added and `cassandra_node_2` and `app_service_2` are updated using the code snippet above
# ...
set_connection(
(MAIN_SUBNETWORK, SECONDARY_SUBNETWORK),
config=ConnectionConfig(
packet_loss_percentage=90.0, # 90% of the packet are dropped
),
)
The result is the following: cassandra_node_2
and app_service_2
are partially unreachable and the resilience of the system can be tested.
Kurtosis enclave
-----------------------------------------------------
| main_subnetwork | secondary_subnetwork |
|-------------------------|---------------------------|
| | |
| cassandra_node_1 | |
| | cassandra_node_2 |
| cassandra_node_3 | |
| app_service_1 | |
| | app_service_2 |
| single_port_proxy | |
| | |
-----------------------------------------------------