The case for cloud development

I would like to present the design paradigm behind many of Kubernetes algorithms in the hope that this could be useful to other teams that develop cloud-based software. While these are concepts that I have found in k8s they could apply in various other cases.

Chiefly on the architecture of k8s is that the hardware is not assumed to be 100% reliable. One might think about the network being unreliable but in operation I have also seen hard disks, virtual machines and even entire clusters failing. And from time-to-time regions. While trying to access a cloud resource can definitively end with failure it is important to understand these are mostly transient errors. Eventually the resource will be restored to operational state.

Kubernetes handles this in a very direct way: Every action is supposed to fail. There is the required specification for each object, and this is kept in a database (etcd). And there is the actual state of the world. Every controller in the system is supposed to get the required spec and then match the world to this spec. Once done so, it updates the status of the object. And it should retry as often as it sees fit.

For this, Kubernetes makes intensive use of rate limiting queues. Typically, a task is pushed in a queue and not executed at hand. A separate worker will pull from the queue and execute. It is important to note that this basic ritual is designed with failure in mind . The typical K8s queue does not provide only an atomic extraction but instead relies on a trio of Get/Done/Forget. A successful operation on a work item goes as follows:

1. The controller retrieves an item from the queue via Get.

2. If the controller successfully executed the required processing on this work item, it calls Forget. This notifies the queue that this item will not be retried at this stage.

3. When the controller finishes working with this item it calls Done.

This works in conjunction with Rate Limiting. This is a concept designed to prevent infinite retries for the case when a non-transient error occurs. The following is an interface declaring this concept:

type RateLimitingInterface interface {

DelayingInterface

// AddRateLimited adds an item to the workqueue after the rate limiter says it's ok

AddRateLimited(item interface{})

//Forget indicates that an item is finished being retried. Doesn't matter whether it's for perm failing

// or for success, we'll stop the rate limiter from tracking it. This only clears the `rateLimiter`, you

// still have to call `Done` on the queue.

Forget(item interface{})

// NumRequeues returns back how many times the item was requeued

NumRequeues(item interface{}) int

}

There are two options available to the controller:

1. Forgets the item. This occurs typically when the task was executed with success. There are other valid cases that include permanent failure and even programming errors in other code segments.

2. Retries the item in the case of a non-transient error via AddRateLimited. Internal mechanics in the queue will ensure this item will be returned after a delay. This delay typically increases as the number of retries increases.

To conclude, the current state of mind in cloud computing is to code against unreliable infrastructure. There are sophisticated technologies like k8s and programing languages like Golang evolved specifically with support for this mantra, although not only for this. Any programmer working with cloud infrastructure should check if the above-mentioned situation is applying to the problem at hand.

I give below an example from operational code in the K8s Autoscaler.

// processNextWorkItem deals with one key off the queue. It returns false when it's time to quit.

func (dsc *DaemonSetsController) processNextWorkItem(ctx context.Context) bool {

dsKey, quit := dsc.queue.Get()

if quit {

return false

}

defer dsc.queue.Done(dsKey)

err := dsc.syncHandler(ctx, dsKey.( string))

if err == nil {

dsc.queue.Forget(dsKey)

return true

}

utilruntime.HandleError(fmt.Errorf( "%v failed with : %v", dsKey, err))

dsc.queue.AddRateLimited(dsKey)

return true

}

Fast Method for Debugging Kubernetes Network Connections with Sidecar Containers

Jun, 2023 Yalos Team

Troubleshooting networking issues in Kubernetes can be a challenging task, especially when it comes to identifying and resolving problems between pods. Efficiently diagnosing these issues is crucial to maintaining a reliable and performant application environment. While there are various tools and techniques available for network debugging, this article presents a fast and flexible method using sidecar containers.

read

Fast Method for Debugging Kubernetes Network Connections with Sidecar Containers

a consulting boutique that delivers software at scale, all around the world, with continuous operation.