We had to figure this out the hard way, and ended up with this approach (approximately).
K8S provides two (well three, now) health checks.
How this interacts with ALB is quite important.
Liveness should always return 200 OK unless you have hit some fatal condition where your container considers itself dead and wants to be restarted.
Readiness should only return 200 OK if you are ready to serve traffic.
We configure the ALB to only point to the readiness check.
So our application lifecycle looks like this:
* Container starts
* Application loads
* Liveness begins serving 200
* Some internal health checks run and set readiness state to True
* Readiness checks now return 200
* ALB checks begin passing and so pod is added to the target group
* Pod starts getting traffic.
time passes. Eventually for some reason the pod needs to shut down.
* Kube calls the preStop hook
* PreStop sends SIGUSR1 to app and waits for N seconds.
* App handler for SIGUSR1 tells readiness hook to start failing.
* ALB health checks begin failing, and no new requests should be sent.
* ALB takes the pod out of the target group.
* PreStop hook finishes waiting and returns
* Kube sends SIGTERM
* App wraps up any remaining in-flight requests and shuts down.
This allows the app to do graceful shut down, and ensures the ALB doesn't send traffic to a pod that knows it is being shut down.
Oh, and on the Readiness check - your app can use this to (temporarily) signal that it is too busy to serve more traffic. Handy as another signal you can monitor for scaling.
A lot of this seems like the fault of the ALB, is it? I had the same problem and eventually moved off of it to cloudflare tunnels pointed at service load balancers directly, which changed immediately when pods went bad. With a grace period for normal shutdowns, I haven't seen any downtime for deploys or errors.
The issue with the above setup is (maybe I'm doing it wrong?) but if a pod is removed suddenly, say if it crashes, then some portion of traffic gets errors until the ALB updates. And that can be an agonizingly long time, which seemed because it's pointed at IP addresses in the cluster and not the service. It seemed like a shortcoming of the ALB. GKE doesn't have the same behavior.
I'm not the expert but found something that worked.
> A lot of this seems like the fault of the ALB, is it?
I definitely think the ALB Controller should be taking a more active hand in termination of pods that are targets of an ALB.
But the ALB Controller is exhibiting the same symptom I keep running into throughout Kubernetes.
The amount of "X is a problem because the pod dies too quickly before Y has a chance to clean up/whatever, so we add a preStop sleep of 30 seconds" in the Kubernetes world is truly frustrating.
Kubernetes was written by people who have developer, not ops, background and is full of things like this. The fact that it became a standard is a disaster
Maybe, or maybe orchestration and load balancing is hard. I think it's too simplistic to dismiss k8s development because the devs weren't ops.
I don't know of a tool that does a significantly better job at this without having other drawbacks and gotchas, and even if it did it doesn't void the value k8s brings.
I have my own set of gripes with software production engineering in general and specially with k8s, having seen first hand how much effort big corps have to put just to manage a cluster, but it's disrespectful to qualify this whole endeavour as disastrous.
If you are referring the 30 seconds to kill time, that would be holding it wrong. As long as your process is PID 1, you can rig up your own process exit handlers, which completely resolves the problem.
Many people don’t run the main process in the container as PID 1, so this “problem” remains.
If it’s not feasible to remove something like a shell process from being the first thing that runs, exec will allow replacing the shell process with the application process.
> If you are referring the 30 seconds to kill time, that would be holding it wrong. As long as your process is PID 1, you can rig up your own process exit handlers, which completely resolves the problem.
Maybe I am holding it wrong. I'd love not to have to do this work.
But I don't see how being PID 1 or not helps (and yes, for most workloads it is PID 1)
The ALB controller is the one that would need to deregister a target from the target group, and it won't until the pod is gone. So we have to force it by having the app do the functional equivalent with the readiness check.
If I understand correctly, because ALB does its own health checks, you need to catch TERM, wait 30s while returning non-ready for ALB to have time to notice, then clean up and shut down.
Pod Readiness Gates, unless I'm missing something, only help on startup.
Unless something has changed since I last went digging into this. You will still have the ALB sending traffic to a pod that's in terminating state, unless you do the preStop bits I talked about in the top of the thread.
> You don't need the preStop scam as long as your workload respects SIGTERM and does lame-duck.
Calling it a scam is a bit much.
I think having to put the logic of how the load balancer works into the application is a crossing of concerns.
This kind of orchestration does not belong in the app, it belongs in the supporting infrastructure.
The app should not need to know how the load balancer works with regards to scheduling.
The ALB Controller should be doing this. It does not, and so we use preStop until/unless the ALB controller figures it out.
Yes, the app needs to listen for SIGTERM and wait until it's outstanding requests are completed before exiting - but not more than that.
Or nginx. In both cases it’s probably more expensive than an ALB but you have better integration with the app side, plus traffic mesh benefits if you’re using istio. The caveat is that you are managing your own public-facing nodes.
Presumably, because it'd be annoying waiting for lame duck mode when you actually do want the application to terminate quickly. SIGKILL usually needs special privileges/root and doesn't give the application any time to clean-up/flush/etc. The other workaround I've seen is having the application clean-up immediately upon a second signal, which I reckon could also work, but either solution seems reasonable.
Using SIGTERM is a problem because it conflicts with other behavior.
For instance, if you use SIGTERM for this then you have a potential for the app quitting during the preStop, which will be detected as a crash by Kube and so restart your app.
We don't want to kill in-flight requests - terminating while a request is outstanding will result in clients connected to the ALB getting some HTTP 5xx response.
The AWS ALB Controller inside Kubernetes doesn't give us a nice way to specifically say "deregister this target"
The ALB will continue to send us traffic while we return 'healthy' to it's health checks.
So we need some way to signal the application to stop serving 'healthy' responses to the ALB Health Checks, which will force the ALB to mark us as unhealthy in the target group and stop sending us traffic.
SIGUSR1 was an otherwise unused signal that we can send to the application without impacting how other signals might be handled.
So I might be putting words in your mouth, so please correct me if this is wrong. It seems like you don’t actually control the SIGTERM handler code. Otherwise you could just write something like:
I don't think it matters the framework, it's an issue with the ALB controller itself, not the application.
The ALB controller doesn't handle gracefully stopping traffic (by ensuring target group de-registration is complete) before allowing the pod to terminate.
Without a preStop, Kube immediately sends SIGTERM to your application.
I know this won't be helpful to folks committed to EKS, but AWS ECS (i.e. running docker containers with AWS controlling) does a really great job on this, we've been running ECS for years (at multiple companies), and basically no hiccups.
One of my former co-workers went to a K8S shop, and longs for the simplicity of ECS.
No software is a panacea, but ECS seems to be one of those "it just works" technologies.
I agree that ECS works great for stateless containerized workloads. But you will need other AWS-managed services for state (RDS), caching (ElastiCache), and queueing (SQS).
So your application is now suddenly spread across multiple services, and you'll need an IaC tool like Terraform, etc.
The beauty (and the main reason we use K8s) is that everything is inside our cluster. We use cloudnative-pg, Redis pods, and RabbitMQ if needed, so everything is maintained in a GitOps project, and we have no IaC management overhead.
(We do manually provision S3 buckets for backups and object storage, though.)
Exactly, not only because Flux/ArgoCD are inherently some sort of IaC themselves, but also because on top of those tools you’ll need to have Terraform to manage the K8s cluster as well as a good practice.
Many companies run k8s for compute and use rds/sqs/redis outside of it. For example RDS is not just hosted PG, it has a whole bunch of features that don’t come out of the box (you do pay for it, I’m not giving an opinion as to whether it’s worth the price)
Anything stateful is not allowed inside the cluster. PVs are annoying enough without having to manage a DB bolted onto what was originally designed for stateless web services.
My db is my cluster. It's been stable for years but I'm afraid to touch it. There's a long outstanding issue in k8s that makes PVs harder to resize than it should be. And they're just more complicated.
Trying to move to managed MySQL now. It'll cost me a bunch more but at least I get a fail over node which I don't know how to set up myself... Still no master-master though, apparently that's not an option.
Actually resizing PVC depends on the CSI drive. Some support easy resizing, some require the volume to be detached. You can double check your CSI driver and might just need to patch your storage class.
Agreed running databases without operators that can handle replication, master promotion backups and PIT restore is super scary. Most of the modern operators support all of these operations.
Yea RDS makes your life easy, notifications and easy application of security patches both OS and DB level (minor version upgrades). Easy upgrade of major versions, easy upgrade of storage, RAM and compute (but not so easy to downgrade), easy options for replication, Blue/Green deployments etc to name a few.
In everything you've listed, my conclusion is the opposite. The spread across multiple managed services is not a bad thing, that's actually better considering that using them reduces operational overhead. That is, the spread is irrelevant if the services are managed.
The ugliness of k8s is that you're bringing your points of failure together into one, mega point of failure and complexity.
Final aside - you absolutely should be using IaC for any serious deployments. If you're using clickops or CLI then the context of the discussion is different and the same critera do not apply.
Yes, but our GitOps repository heavily utilizes Kustomize and Flux, allowing us to reuse a significant amount of code across multiple deployment stages and clusters, which has proven to be very effective.
We have worked with Terraform modules before, but they quickly became difficult to manage.
Additionally, deployments to ECS are typically handled by invoking the AWS API within a GitHub Action, without continuous reconciliation or drift detection.
> Additionally, deployments to ECS are typically handled by invoking the AWS API within a GitHub Action, without continuous reconciliation or drift detection.
No they aren’t. All of the major IaC solutions (TF, CDK, etc) do ECS deployments directly through their own API, including with drift detection and updates.
Good for you for finding something that works, but it sounds like your advice related to IaC solutions is based on a misunderstanding of the benefits of IaC and the tools available.
You make a great point that when everything is on kube it’s easier to manage.
But… if you are maintaining storage buckets and stuff elsewhere (to avoid accidental deletion etc, a worthy cause) then you are using terraform regardless. So adding RDS etc to the mix is not as tough as you make it sound.
I see both sides of the fence and both have their pros and cons.
If you have great operational experience with kube though I’d go all in on that. AWS bends you over with management fees… it’s far more affordable to run a DB, RMQ, etc on your own versus RDS, AMQ
AWS Controller for Kubernetes (ACK)[1] provides resources for creating S3 buckets as CR. Also in combination with Pod Identities there is no need for tf.
> One of my former co-workers went to a K8S shop, and longs for the simplicity of ECS.
I was using K8s previously, and I’m currently using ECS in my current team, and I hate it. I would _much _ rather have K8s back. The UX is all over the place, none of my normal tooling works, deployment configs are so much worse than the K8s equivalent.
I think like a lot of things, once you’re used to having the knobs of k8s and its DX, you’ll want them always. But a lot of teams adopt k8s because they need a containerized service in AWS, and have no real opinions about how, and in those cases ECS is almost always easier (even with all its quirks).
(And it’s free, if you don’t mind the mild lock-in).
Checks out - I was reading the rest of this and thought "geez, I use ECS and it's nowhere near as complicated as this". Glad I wasn't missing anything.
I've never used Kubernetes myself, but ECS seems to "just work" for my use case of run a simple web app with autoscaling and no downtime.
The amount of companies who use K8s when they have no business nor technological justification for it is staggering. It is the number one blocker in moving to bare metal/on prem when costs become too much.
Yes, on prem has its gotchas just like the EKS deployment described in the post, but everything is so much simpler and straightforward it's much easier to grasp the on prem side of things.
I've come at this from a slightly different angle. I've seen many clients running k8s on expensive cloud instances, but to me that is solving the same problems twice. Both k8s and cloud instances solve a highly related and overlapping set of problems.
Instead you can take k8s, deploy it to bare metal, and have a much much more power for a much lower cost. Of course this requires some technical knowledge, but the benefits are significant (lower costs, stable costs, no vendor lock-in, all the postgres extensions you want, response times halved, etc).
k8s smoothes over the vagaries of bare-metal very nicely.
If you'll excuse a quick plug for my work: We [1] offer a middle ground for this, whereby we do and manage all this for you. We take over all DevOps and infrastructure responsibility while also cutting spend by around 50%. (cloud hardware really is that expensive in comparison).
>Instead you can take k8s, deploy it to bare metal, and have a much much more power for a much lower cost. Of course this requires some technical knowledge, but the benefits are significant (lower costs, stable costs, no vendor lock-in, all the postgres extensions you want, response times halved, etc).
>all the postgres extensions you want
You can run Postgres in any managed K8s environment (say AWS EKS) just fine and enable any extensions you want as well. Unless you're conflating managed Postgres solutions like RDS, which would imply that the only way to run databases is by using a managed service of your cloud of choice, which obviously isn't true.
> You can run Postgres in any managed K8s environment (say AWS EKS) just fine and enable any extensions you want as well.
You absolutely can do this, and we do ineed run Postgres in-cluster.
We generally see that people prefer a managed solution when it comes to operating their databases. Which means that when it comes to their (eg) AWS EKS clusters, they often use RDS rather than running the DB in-cluster.
Our service is also a managed service, and that comes with in-cluster databases. So clients still get a managed service, but without the limitations of (eg) RDS.
Could you expand a bit on the point of K8S being a blocker to moving to on-prem?
Naively, I would think it be neutral, since I would assume that if a customer gets k8s running on-prem, then apps designed for running in k8s should have a straightforward migration path?
I can expand a little bit, but based on your question, I suspect you may know everything I'm going to type.
In cloud environments, it's pretty common that your cloud provider has specific implementations of Kubernetes objects, either by creating custom resources that you can make use of, or just building opinionated default instances of things like storage classes, load balancers, etc.
It's pretty easy to not think about the implementation details of, say, an object-storage-backed PVC until you need to do it in a K8s instance that doesn't already have your desired storage class. Then you've got to figure out how to map your simple-but-custom $thing from provider-managed to platform-managed. If you're moving into Rancher, for instance, it's relatively batteries-included, but there are definitely considerations you need to make for things like how machines are built from disk storage perspective and where longhorn drives are mapped, for instance.
It's like that for a ton of stuff, and a whole lot of the Kubernetes/OutsideInfra interface is like that. Networking, storage, maybe even certificate management, those all need considerations if you're migrating from cloud to on-prem.
I think K8S distributions like K3S make this way simpler. If you’re wanting to run distributed object storage on bare metal the you’re in store for a lot of complexity, with or without k8s.
I’ve ran 3 server k3s instances on bare metal and they work very well with little maintenance. I didn’t do anything special, and while it’s more complex than some ansible scripts and haproxy, I think the breadth of tooling makes it worth it.
I ran K3S locally during the pandemic and the only issue at the time was getting PV/PVC provisioned cleanly, I think Longhorn was just reaching maturity and five years ago the docs were pretty sparse. But yeah k3s is a dream to work with in 2025 the docs are great and as long as you stay on the happy path and your network is setup it's about as effortless as cluster computing can get.
I've been running one for a couple years now, and even in that short of time Longhorn has made huge leaps in maturity. It was/is definitely the weakest link.
Cost wise it's a no brainer. Three servers with 64 GB ECC and 6 cores for the price of three M5 larges. So 192 GB and 18 cores for the price of 24GB and 6 cores.
I think one of reason k8s can get a bad rap is how expensive it is to even approach doing it right with cloud hosting, but to me it seems like a perfect use case for bare metal where there is no built in orchestration.
Here is your business justification: K8s / Helm charts have become the de-facto standard for packaging applications for on-premise deployments. If you choose any other deployment option on a setup/support contract, the supplier will likely charge you for additional hours.
This is also what we observe while building Distr. ISVs are in need for a container registry to hand over these images to their customers. Our container registry will be purpose build for this use-case.
> The amount of companies who use K8s when they have no business nor technological justification for it is staggering.
I remember a guy I used to work with telling me he'd been at a consulting shop and they used Kubernetes for everything - including static marketing sites. I assume it was a combination of resume and bill padding.
Out of interest do you recommend any good places to host a machine in the US? A major part of why I like cloud is because it really simplifies the hardware maintenance.
I'm running kubernetes on digital ocean. It was under $100/mo until last week when I upgraded a couple nodes because memory was getting a bit tight. That was just a couple clicks so not a big deal.
We've been with them over 10 years now. Mostly pretty happy. They've had a couple small outages.
A few years ago, while helping build a platform on Google Cloud & GKE for a client, we found the same issues.
At that point we already had a CRD used by most of out tenant apps, which deployed an opinionated (but generally flexible enough) full app stack (Deployment, Service, PodMonitor, many sane defaults for affinity/anti-affinity, etc, lots of which configurable, and other things).
Because we didn't have an opinion on what tenant apps would use in their containers, we needed a way to make the pre-stop sleep small but OS-agnostic.
We ended up with a 1 LOC (plus headers) C app that compiled to a tiny static binary. This was put in a ConfigMap, which the controller mounted on the Pod, from where it could be executed natively.
Perhaps not the most elegant solution, but a simple enough one that got the job done and was left alone with zero required maintenance for years - it might still be there to this day. It was quite fun to watch the reaction of new platform engineers the first time they'd come across it in the codebase. :D
I realized somewhat recently I could put my Nginx and PHP ini in a config map, that seems to work ok. Even that seems a bit dirty though, doesn't it base64 it and save it with all the other yaml configs? Doesn't seem like it's made for files
I'm not sure why they state "although the AWS Load Balancer Controller is a fantastic piece of software, it is surprisingly tricky to roll out releases without downtime."
The AWS Load Balancer Controller uses readiness gates by default, exactly as described in the article. Am I missing something?
Edit: Ah, it's not by default, it requires a label in the namespace. I'd forgotten about this. To be fair though, the AWS docs tell you to add this label.
Yes, that is what we thought as well, but it turns out that the there is still a delay between the load balancer controller registering a target as offline and the pod actually being already terminated. We did some benchmarks to highlight that gap.
I think the "label (edit: annotation) based configuration" has got to be my least favorite thing about the k8s ecosystem. They're super magic, completely undiscoverable outside the documentation, not typed, not validated (for mutually exclusive options), and rely on introspecting the cluster and so aren't part of the k8s solver.
AWS uses them for all of their integrations and they're never not annoying.
I think you mean annotations. Labels and annotations are different things. And btw. Annotations can be validated and can be typed. With validation webhooks.
This is actually a fascinatingly complex problem. Some notes about the article:
* The 20s delay before shutdown is called “lame duck mode.” As implemented it’s close to good, but not perfect.
* When in lame duck mode you should fail the pod’s health check. That way you don’t rely on the ALB controller to remove your pod. Your pod is still serving other requests, but gracefully asking everyone to forget about it.
* Make an effort to close http keep-alive connections. This is more important if you’re running another proxy that won’t listen to the health checks above (eg AWS -> Node -> kube-proxy -> pod). Note that you can only do that when a request comes in - but it’s as simple as a Connection: close header on the response.
* On a fun note, the new-ish kubernetes graceful node shutdown feature won’t remove your pod readiness when shutting down.
More likely they mean "readiness check" - this is the one that removes you from the Kubernetes load balancer service. Liveness check failing does indeed cause the container to restart.
Yes sorry for not qualifying - that’s right. IMO the liveness check is only rarely useful - but I've not really run any bleeding edge services on kube. I assume it’s more useful if you actually working on dangerous code - locking, threading, etc. I’ve mostly only run web apps.
>The truth is that although the AWS Load Balancer Controller is a fantastic piece >of software, it is surprisingly tricky to roll out releases without downtime.
20 years ago we used simple bash scripts using curl to do rest calls to take one host out of our load balancers, then scp to the host and shut down the app gracefully, and updated the app using scp again, then put it back into the load balancer after testing the host on its own. we had 4 or 5 scripts max, straightforward stuff..
They charge $$$ and you get downtime in this simple scenario ?
I used to work in this world, too. What is described here about EKS/K8s sounds tricky but it is actually pretty simple and quite a lot more standardized than what we all used to do. You have two health checks and using those, the app has total control over whether it’s serving traffic or not and gives the scheduler clear guidance about whether or not to restart it. You build it once (20 loc maybe) and then all your apps work the same way. We just have this in our cookie cutter repo.
Does this or any of the strategies listed in the comments properly handle long lived client connections? It's sufficient enough to wait for the LB to stop sending traffic when connections are 100s of ms or less but when connections are minutes or even hours long it doesn't work out well.
Is there a slick strategy for this? Is it possible to have minutes long pre-stop hooks? Is the only option to give client connections an abandon ship message and kick them out hopefully fast enough?
The fact that the state of the art container orchestration system requires you to run a sleep command in order to not drop traffic on the floor is a travesty of system design.
We had perfectly good rolling deploys before k8s came on the scene, but k8s insistence on a single-phase deployment process means we end up with this silly workaround.
I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system. I'm pretty sure it could still have had a 2 phase pod shutdown by encoding a timeout on the first stage. Sure, it would have made some internals require more complex state - but isn't that the point of k8s? Instead everyone has to rediscover the sleep hack over and over again.
In fairness to Kubernetes, this partially due to AWS and how their ALB/NLB interact with Kubernetes. So, when Kubernetes starts to replace Pods, the Amazon ALB/NLB Controller starts reacting, however, it must make calls to Amazon API and wait for ALB/NLB to catch up with changing state of the cluster. Kubernetes is not aware of this and continues on blindly. If Ingress Controller was more integrated into the cluster, you wouldn't have this problem. We run Ingress-Nginx at work instead of ALB for this reason.
Thus, this entire system of "Mark me not ready, wait for ALB/NLB to realize I'm not ready and stop sending traffic, wait for that to finish, terminate and Kubernetes continues with rollout."
You would have same problem if you just started up new images in autoscaling group and randomly SSH into old ones and running "shutdown -h now". ALB would be shocked by sudden departure of VMs and you would probably get traffic going to old VMs until health checks caught up.
EDIT: Azure/GCP have same issue if you use their provided ALBs.
Nginx ingress has the same problem, it's just much faster at switching over when a pod is marked as unready because it's continuously watching the endpoints.
Kubernetes is missing a mechanism for load balancing services (like ingress, gateways) to ack pods being marked as not ready before the pod itself is terminated.
They are a few warts like this with core/apps controllers. Nothing unfixable within general k8s design imho but unfortunately most of the community have moved on to newer shinier things
It shouldn't. I've not had the braincells yet to fully internalize the entire article, but it seems like we go wrong about here:
> The AWS Load Balancer keeps sending new requests to the target for several seconds after the application is sent the termination signal!
And then concluded a wait is required…? Yes, traffic might not cease immediately, but you drain the connections to the load balancer, and then exit. A decent HTTP framework should be doing this by default on SIGTERM.
> I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system.
Yeah, I wouldn't agree with that either. A terminating pod is inherently "not ready", that not-ready state should cause the load balancer to remove it from rotation. Similarly, the pod itself can drain its connections to the load balancer. That could take time; there's always going to be some point at which you'd have to give up on a slowloris request.
The fundamental gap in my opinion, is that k8s has no mechanism (that I am aware of) to notify the load balancing mechanism (whether that's a service, ingress or gateway) that it intends to remove a node - and for the load balancer to confirm this is complete.
This is how all pre-k8s rolling deployment systems I've used have worked.
So instead we move the logic to the application, and put a sleep in the shutdown phase to account for the time it takes for the load balancer to process/acknowledge the shutdown and stop routing new traffic to that node.
K8s made simple things complicated, yet it doesn't have obvious safety (or sanity) mechanisms, making everyday life a PITA. I wonder why it was adopted so quickly despite its flaws, and the only thing coming to my mind is, like Java in 90s: massive marketing and propaganda that it's "inevitable"..
> put a sleep in the shutdown phase to account for the time it takes for the load balancer to process/acknowledge the shutdown and stop routing new traffic to that node.
Again, I don't see why the sleep is required. You're removed from the load balancer when the last connection from the LB closes.
Most http frameworks don't do this right. They typically wait until all known in-flight requests complete and then exit. That's usually too fast for a load balancer that's still sending new requests. Instead you should just wait 30 seconds or so while still accepting new requests and replying not ready to load balancer health checks, and then if you want to wait additional time for long running requests, you can. You can also send clients "connection: close" to convince them to reopen connections against different backends.
> That's usually too fast for a load balancer that's still sending new requests.
How?
A load balancer can't send a new request on a connection that doesn't exist. (Existing connections being gracefully torn down as requests conclude on them & as the underlying protocol permits.) If it cannot open a connection to the backend (the backend should not allow new connections when the drain starts) then by definition new requests cannot end up at the backend.
K8S is overrated, it's actually pretty terrible but everyone has been convinced it's the solution to all of their problems because it's slightly better than what we had 15 years ago (Ansible/Puppet/Bash/immutable deployments) at 10x the complexity. There are so many weird edge cases just waiting to completely ruin your day. Like subPath mounts. If you use subPath then changes to a ConfigMap don't get reflected into the container. The container doesn't get restarted either of course, so you have config drift built in, unless you install one of those weird hacky controllers that restarts pods for you.
I wouldn't throw away k8s just for subPath weirdness, but I hear your general point about complexity. But if you are throwing away Ansible and Puppet, what is your solution? Also I'm not entirely sure what you are getting at with bash (what does shell scripting have to do with it?) and immutable deployments.
That's only one example of K8s weirdness that can wake you up at 3am. How: change is rolled out during business hours that changes service config inside ConfigMap. Pod doesn't get notified or reload this change. Pod crashes at night, loads the new (bad/invalid) config, takes down production. To add insult to injury, the engineers spend hours debugging the issue because it's completely unintuitive that CM changes are not reflected ONLY when using subPath.
That's totally valid. I understand the desire of k8s maintainers to prevent "cascading changes" from happening, but this one is a very reasonable feature they seem to not support.
There's a pretty common hack to make things restart on a config change by adding a pod annotation with the configmap hash:
That's how I do it, with kustomize. Definitely confused me before I learned that, but hasn't been an issue for years. And if you don't use kustomize, you just do... What was it kubectl rollout? Add that to the end you deploy script and you're good.
We’re using Argo rollouts without issue. It’s a super set of a deployment with configuration based blue green deploy or canary. Works great for us and allows us to get around the problem laid out in this article.
Argo Rollouts is an extra orchestration layer on top of a traffic management provider. Which one are you using? If you use the ALB controller you still have to deal with pod shutdown / target deregistration timing issues.
We’re using the alb controller to expose our kind: Rollouts. The blue green configuration has some sort of delay before cutting over which prevents any 5xx class errors due to target groups (at least for us)
We had to figure this out the hard way, and ended up with this approach (approximately).
K8S provides two (well three, now) health checks.
How this interacts with ALB is quite important.
Liveness should always return 200 OK unless you have hit some fatal condition where your container considers itself dead and wants to be restarted.
Readiness should only return 200 OK if you are ready to serve traffic.
We configure the ALB to only point to the readiness check.
So our application lifecycle looks like this:
* Container starts
* Application loads
* Liveness begins serving 200
* Some internal health checks run and set readiness state to True
* Readiness checks now return 200
* ALB checks begin passing and so pod is added to the target group
* Pod starts getting traffic.
time passes. Eventually for some reason the pod needs to shut down.
* Kube calls the preStop hook
* PreStop sends SIGUSR1 to app and waits for N seconds.
* App handler for SIGUSR1 tells readiness hook to start failing.
* ALB health checks begin failing, and no new requests should be sent.
* ALB takes the pod out of the target group.
* PreStop hook finishes waiting and returns
* Kube sends SIGTERM
* App wraps up any remaining in-flight requests and shuts down.
This allows the app to do graceful shut down, and ensures the ALB doesn't send traffic to a pod that knows it is being shut down.
Oh, and on the Readiness check - your app can use this to (temporarily) signal that it is too busy to serve more traffic. Handy as another signal you can monitor for scaling.
e: Formatting was slightly broken.
A lot of this seems like the fault of the ALB, is it? I had the same problem and eventually moved off of it to cloudflare tunnels pointed at service load balancers directly, which changed immediately when pods went bad. With a grace period for normal shutdowns, I haven't seen any downtime for deploys or errors.
The issue with the above setup is (maybe I'm doing it wrong?) but if a pod is removed suddenly, say if it crashes, then some portion of traffic gets errors until the ALB updates. And that can be an agonizingly long time, which seemed because it's pointed at IP addresses in the cluster and not the service. It seemed like a shortcoming of the ALB. GKE doesn't have the same behavior.
I'm not the expert but found something that worked.
> A lot of this seems like the fault of the ALB, is it?
I definitely think the ALB Controller should be taking a more active hand in termination of pods that are targets of an ALB.
But the ALB Controller is exhibiting the same symptom I keep running into throughout Kubernetes.
The amount of "X is a problem because the pod dies too quickly before Y has a chance to clean up/whatever, so we add a preStop sleep of 30 seconds" in the Kubernetes world is truly frustrating.
Kubernetes was written by people who have developer, not ops, background and is full of things like this. The fact that it became a standard is a disaster
Maybe, or maybe orchestration and load balancing is hard. I think it's too simplistic to dismiss k8s development because the devs weren't ops.
I don't know of a tool that does a significantly better job at this without having other drawbacks and gotchas, and even if it did it doesn't void the value k8s brings.
I have my own set of gripes with software production engineering in general and specially with k8s, having seen first hand how much effort big corps have to put just to manage a cluster, but it's disrespectful to qualify this whole endeavour as disastrous.
If you are referring the 30 seconds to kill time, that would be holding it wrong. As long as your process is PID 1, you can rig up your own process exit handlers, which completely resolves the problem.
Many people don’t run the main process in the container as PID 1, so this “problem” remains.
If it’s not feasible to remove something like a shell process from being the first thing that runs, exec will allow replacing the shell process with the application process.
> If you are referring the 30 seconds to kill time, that would be holding it wrong. As long as your process is PID 1, you can rig up your own process exit handlers, which completely resolves the problem.
Maybe I am holding it wrong. I'd love not to have to do this work.
But I don't see how being PID 1 or not helps (and yes, for most workloads it is PID 1)
The ALB controller is the one that would need to deregister a target from the target group, and it won't until the pod is gone. So we have to force it by having the app do the functional equivalent with the readiness check.
Yeah, exactly. We just catch the TERM, clean up, and then shut down. But the rest of the top post in the thread is right on.
If I understand correctly, because ALB does its own health checks, you need to catch TERM, wait 30s while returning non-ready for ALB to have time to notice, then clean up and shut down.
> A lot of this seems like the fault of the ALB, is it?
People forget to enable pod readiness gates.
Pod Readiness Gates, unless I'm missing something, only help on startup.
Unless something has changed since I last went digging into this. You will still have the ALB sending traffic to a pod that's in terminating state, unless you do the preStop bits I talked about in the top of the thread.
https://kubernetes-sigs.github.io/aws-load-balancer-controll...
> Pod Readiness Gates, unless I'm missing something, only help on startup.
Also allows graceful rollout of workload.
> You will still have the ALB sending traffic to a pod that's in terminating state
The controller watches endpoints and will remove your pod from target group on pod deletion.
You don't need the preStop scam as long as your workload respects SIGTERM and does lame-duck.
> You don't need the preStop scam as long as your workload respects SIGTERM and does lame-duck.
Calling it a scam is a bit much.
I think having to put the logic of how the load balancer works into the application is a crossing of concerns. This kind of orchestration does not belong in the app, it belongs in the supporting infrastructure.
The app should not need to know how the load balancer works with regards to scheduling.
The ALB Controller should be doing this. It does not, and so we use preStop until/unless the ALB controller figures it out.
Yes, the app needs to listen for SIGTERM and wait until it's outstanding requests are completed before exiting - but not more than that.
Just curious:
- so if pod goes to terminating state
- with gates enabled, alb controller should remove it from targets instantly coz it listens to k8s api pod changes stream ?
In my experience there was ALWAYS some delay even a small one in High Frequency systems which caused 500s.
Which we solved with internal apigateway, aws+iptables+cni was always causing issues in every setup without it.
Istio automates this (at the risk of adding more complexity)
Or nginx. In both cases it’s probably more expensive than an ALB but you have better integration with the app side, plus traffic mesh benefits if you’re using istio. The caveat is that you are managing your own public-facing nodes.
Why the additional SUGUSR1 vs just doing those (failing health, sleeping) on SIGTERM?
Presumably, because it'd be annoying waiting for lame duck mode when you actually do want the application to terminate quickly. SIGKILL usually needs special privileges/root and doesn't give the application any time to clean-up/flush/etc. The other workaround I've seen is having the application clean-up immediately upon a second signal, which I reckon could also work, but either solution seems reasonable.
Yeah, there were a bunch of reasons.
Using SIGTERM is a problem because it conflicts with other behavior.
For instance, if you use SIGTERM for this then you have a potential for the app quitting during the preStop, which will be detected as a crash by Kube and so restart your app.
> which will be detected as a crash by Kube and so restart your app.
I don't think kubernetes restarts pods that have been marked for termination
We have a number of concurrent issues.
We don't want to kill in-flight requests - terminating while a request is outstanding will result in clients connected to the ALB getting some HTTP 5xx response.
The AWS ALB Controller inside Kubernetes doesn't give us a nice way to specifically say "deregister this target"
The ALB will continue to send us traffic while we return 'healthy' to it's health checks.
So we need some way to signal the application to stop serving 'healthy' responses to the ALB Health Checks, which will force the ALB to mark us as unhealthy in the target group and stop sending us traffic.
SIGUSR1 was an otherwise unused signal that we can send to the application without impacting how other signals might be handled.
So I might be putting words in your mouth, so please correct me if this is wrong. It seems like you don’t actually control the SIGTERM handler code. Otherwise you could just write something like:
Technically the server shutdown at the end doesn’t even need to be graceful in this case.Curious, which framework are you using? I've had no issues with NodeJS, Go, and Rust apps directly behind ALB with IP-Target.
I don't think it matters the framework, it's an issue with the ALB controller itself, not the application.
The ALB controller doesn't handle gracefully stopping traffic (by ensuring target group de-registration is complete) before allowing the pod to terminate.
Without a preStop, Kube immediately sends SIGTERM to your application.
> App handler for SIGUSR1 tells readiness hook to start failing.
Doesn't the kubernetes pod shutdown already mark the pod as not-ready before it calls the pre-stop hook?
Racing against an ASG/ALB combo is always a horrifying adrenaline rush.
Nobody should be using ASG's anymore. EKS Auto Mode or Karpenter.
I know this won't be helpful to folks committed to EKS, but AWS ECS (i.e. running docker containers with AWS controlling) does a really great job on this, we've been running ECS for years (at multiple companies), and basically no hiccups.
One of my former co-workers went to a K8S shop, and longs for the simplicity of ECS.
No software is a panacea, but ECS seems to be one of those "it just works" technologies.
I agree that ECS works great for stateless containerized workloads. But you will need other AWS-managed services for state (RDS), caching (ElastiCache), and queueing (SQS).
So your application is now suddenly spread across multiple services, and you'll need an IaC tool like Terraform, etc.
The beauty (and the main reason we use K8s) is that everything is inside our cluster. We use cloudnative-pg, Redis pods, and RabbitMQ if needed, so everything is maintained in a GitOps project, and we have no IaC management overhead.
(We do manually provision S3 buckets for backups and object storage, though.)
Mentioning “no IaC management overhead” is weird. If you’re not using IaC, you’re doing it wrong.
However, GitOps is IaC, just by another name, so you actually do have IaC “overhead”.
Exactly, not only because Flux/ArgoCD are inherently some sort of IaC themselves, but also because on top of those tools you’ll need to have Terraform to manage the K8s cluster as well as a good practice.
Many companies run k8s for compute and use rds/sqs/redis outside of it. For example RDS is not just hosted PG, it has a whole bunch of features that don’t come out of the box (you do pay for it, I’m not giving an opinion as to whether it’s worth the price)
Yup. We do that.
Anything stateful is not allowed inside the cluster. PVs are annoying enough without having to manage a DB bolted onto what was originally designed for stateless web services.
My db is my cluster. It's been stable for years but I'm afraid to touch it. There's a long outstanding issue in k8s that makes PVs harder to resize than it should be. And they're just more complicated. Trying to move to managed MySQL now. It'll cost me a bunch more but at least I get a fail over node which I don't know how to set up myself... Still no master-master though, apparently that's not an option.
Actually resizing PVC depends on the CSI drive. Some support easy resizing, some require the volume to be detached. You can double check your CSI driver and might just need to patch your storage class.
Agreed running databases without operators that can handle replication, master promotion backups and PIT restore is super scary. Most of the modern operators support all of these operations.
Yea RDS makes your life easy, notifications and easy application of security patches both OS and DB level (minor version upgrades). Easy upgrade of major versions, easy upgrade of storage, RAM and compute (but not so easy to downgrade), easy options for replication, Blue/Green deployments etc to name a few.
In everything you've listed, my conclusion is the opposite. The spread across multiple managed services is not a bad thing, that's actually better considering that using them reduces operational overhead. That is, the spread is irrelevant if the services are managed.
The ugliness of k8s is that you're bringing your points of failure together into one, mega point of failure and complexity.
Final aside - you absolutely should be using IaC for any serious deployments. If you're using clickops or CLI then the context of the discussion is different and the same critera do not apply.
Yes, but our GitOps repository heavily utilizes Kustomize and Flux, allowing us to reuse a significant amount of code across multiple deployment stages and clusters, which has proven to be very effective.
We have worked with Terraform modules before, but they quickly became difficult to manage.
Additionally, deployments to ECS are typically handled by invoking the AWS API within a GitHub Action, without continuous reconciliation or drift detection.
> Additionally, deployments to ECS are typically handled by invoking the AWS API within a GitHub Action, without continuous reconciliation or drift detection.
No they aren’t. All of the major IaC solutions (TF, CDK, etc) do ECS deployments directly through their own API, including with drift detection and updates.
Good for you for finding something that works, but it sounds like your advice related to IaC solutions is based on a misunderstanding of the benefits of IaC and the tools available.
You’ve replaced IaC overhead with k8s overhead
How do you run all this on developer's machine?
You make a great point that when everything is on kube it’s easier to manage.
But… if you are maintaining storage buckets and stuff elsewhere (to avoid accidental deletion etc, a worthy cause) then you are using terraform regardless. So adding RDS etc to the mix is not as tough as you make it sound.
I see both sides of the fence and both have their pros and cons.
If you have great operational experience with kube though I’d go all in on that. AWS bends you over with management fees… it’s far more affordable to run a DB, RMQ, etc on your own versus RDS, AMQ
AWS Controller for Kubernetes (ACK)[1] provides resources for creating S3 buckets as CR. Also in combination with Pod Identities there is no need for tf.
[1]https://aws-controllers-k8s.github.io/community/docs/user-do...
[dead]
> One of my former co-workers went to a K8S shop, and longs for the simplicity of ECS.
I was using K8s previously, and I’m currently using ECS in my current team, and I hate it. I would _much _ rather have K8s back. The UX is all over the place, none of my normal tooling works, deployment configs are so much worse than the K8s equivalent.
I think like a lot of things, once you’re used to having the knobs of k8s and its DX, you’ll want them always. But a lot of teams adopt k8s because they need a containerized service in AWS, and have no real opinions about how, and in those cases ECS is almost always easier (even with all its quirks).
(And it’s free, if you don’t mind the mild lock-in).
Checks out - I was reading the rest of this and thought "geez, I use ECS and it's nowhere near as complicated as this". Glad I wasn't missing anything.
I've never used Kubernetes myself, but ECS seems to "just work" for my use case of run a simple web app with autoscaling and no downtime.
Completely agree, unless you are operating a platform for others to deploy to, ECS is a lot simpler, and works really well for a lot of common setups.
If you're on GCP, Google Cloud Run also "just works" quite well, too.
Amazing product, doesn’t get nearly the attention it deserves. ECS is a hot spaghetti mess in comparison.
We've been moving away from K8S to ECS...it just works without all the complexity.
I run https://BareMetalSavings.com.
The amount of companies who use K8s when they have no business nor technological justification for it is staggering. It is the number one blocker in moving to bare metal/on prem when costs become too much.
Yes, on prem has its gotchas just like the EKS deployment described in the post, but everything is so much simpler and straightforward it's much easier to grasp the on prem side of things.
I've come at this from a slightly different angle. I've seen many clients running k8s on expensive cloud instances, but to me that is solving the same problems twice. Both k8s and cloud instances solve a highly related and overlapping set of problems.
Instead you can take k8s, deploy it to bare metal, and have a much much more power for a much lower cost. Of course this requires some technical knowledge, but the benefits are significant (lower costs, stable costs, no vendor lock-in, all the postgres extensions you want, response times halved, etc).
k8s smoothes over the vagaries of bare-metal very nicely.
If you'll excuse a quick plug for my work: We [1] offer a middle ground for this, whereby we do and manage all this for you. We take over all DevOps and infrastructure responsibility while also cutting spend by around 50%. (cloud hardware really is that expensive in comparison).
[1]: https://lithus.eu
>Instead you can take k8s, deploy it to bare metal, and have a much much more power for a much lower cost. Of course this requires some technical knowledge, but the benefits are significant (lower costs, stable costs, no vendor lock-in, all the postgres extensions you want, response times halved, etc).
>all the postgres extensions you want
You can run Postgres in any managed K8s environment (say AWS EKS) just fine and enable any extensions you want as well. Unless you're conflating managed Postgres solutions like RDS, which would imply that the only way to run databases is by using a managed service of your cloud of choice, which obviously isn't true.
> You can run Postgres in any managed K8s environment (say AWS EKS) just fine and enable any extensions you want as well.
You absolutely can do this, and we do ineed run Postgres in-cluster.
We generally see that people prefer a managed solution when it comes to operating their databases. Which means that when it comes to their (eg) AWS EKS clusters, they often use RDS rather than running the DB in-cluster.
Our service is also a managed service, and that comes with in-cluster databases. So clients still get a managed service, but without the limitations of (eg) RDS.
Could you expand a bit on the point of K8S being a blocker to moving to on-prem?
Naively, I would think it be neutral, since I would assume that if a customer gets k8s running on-prem, then apps designed for running in k8s should have a straightforward migration path?
I can expand a little bit, but based on your question, I suspect you may know everything I'm going to type.
In cloud environments, it's pretty common that your cloud provider has specific implementations of Kubernetes objects, either by creating custom resources that you can make use of, or just building opinionated default instances of things like storage classes, load balancers, etc.
It's pretty easy to not think about the implementation details of, say, an object-storage-backed PVC until you need to do it in a K8s instance that doesn't already have your desired storage class. Then you've got to figure out how to map your simple-but-custom $thing from provider-managed to platform-managed. If you're moving into Rancher, for instance, it's relatively batteries-included, but there are definitely considerations you need to make for things like how machines are built from disk storage perspective and where longhorn drives are mapped, for instance.
It's like that for a ton of stuff, and a whole lot of the Kubernetes/OutsideInfra interface is like that. Networking, storage, maybe even certificate management, those all need considerations if you're migrating from cloud to on-prem.
I think K8S distributions like K3S make this way simpler. If you’re wanting to run distributed object storage on bare metal the you’re in store for a lot of complexity, with or without k8s.
I’ve ran 3 server k3s instances on bare metal and they work very well with little maintenance. I didn’t do anything special, and while it’s more complex than some ansible scripts and haproxy, I think the breadth of tooling makes it worth it.
I ran K3S locally during the pandemic and the only issue at the time was getting PV/PVC provisioned cleanly, I think Longhorn was just reaching maturity and five years ago the docs were pretty sparse. But yeah k3s is a dream to work with in 2025 the docs are great and as long as you stay on the happy path and your network is setup it's about as effortless as cluster computing can get.
I've been running one for a couple years now, and even in that short of time Longhorn has made huge leaps in maturity. It was/is definitely the weakest link.
Cost wise it's a no brainer. Three servers with 64 GB ECC and 6 cores for the price of three M5 larges. So 192 GB and 18 cores for the price of 24GB and 6 cores.
I think one of reason k8s can get a bad rap is how expensive it is to even approach doing it right with cloud hosting, but to me it seems like a perfect use case for bare metal where there is no built in orchestration.
Here is your business justification: K8s / Helm charts have become the de-facto standard for packaging applications for on-premise deployments. If you choose any other deployment option on a setup/support contract, the supplier will likely charge you for additional hours.
This is also what we observe while building Distr. ISVs are in need for a container registry to hand over these images to their customers. Our container registry will be purpose build for this use-case.
> The amount of companies who use K8s when they have no business nor technological justification for it is staggering.
I remember a guy I used to work with telling me he'd been at a consulting shop and they used Kubernetes for everything - including static marketing sites. I assume it was a combination of resume and bill padding.
I'm using k8s for my static marketing site. It's in the same cluster as my app tho, so I'm not paying extra for it. Don't think I'd do it otherwise.
Oh agreed - that makes sense.
This guy told me it was just shameless over-engineering.
Out of interest do you recommend any good places to host a machine in the US? A major part of why I like cloud is because it really simplifies the hardware maintenance.
I'm running kubernetes on digital ocean. It was under $100/mo until last week when I upgraded a couple nodes because memory was getting a bit tight. That was just a couple clicks so not a big deal. We've been with them over 10 years now. Mostly pretty happy. They've had a couple small outages.
Talos for on prem k8s is dead simple
A few years ago, while helping build a platform on Google Cloud & GKE for a client, we found the same issues.
At that point we already had a CRD used by most of out tenant apps, which deployed an opinionated (but generally flexible enough) full app stack (Deployment, Service, PodMonitor, many sane defaults for affinity/anti-affinity, etc, lots of which configurable, and other things).
Because we didn't have an opinion on what tenant apps would use in their containers, we needed a way to make the pre-stop sleep small but OS-agnostic.
We ended up with a 1 LOC (plus headers) C app that compiled to a tiny static binary. This was put in a ConfigMap, which the controller mounted on the Pod, from where it could be executed natively.
Perhaps not the most elegant solution, but a simple enough one that got the job done and was left alone with zero required maintenance for years - it might still be there to this day. It was quite fun to watch the reaction of new platform engineers the first time they'd come across it in the codebase. :D
An executable in a ConfigMap? That's interesting.
I realized somewhat recently I could put my Nginx and PHP ini in a config map, that seems to work ok. Even that seems a bit dirty though, doesn't it base64 it and save it with all the other yaml configs? Doesn't seem like it's made for files
> doesn't it base64 it and save it with all the other yaml configs
It does! It's mountable in the filesystem though. In this case, the data key is the filename, and its un-base64'd data, the file contents.
> Even that seems a bit dirty though
As I mentioned in the previous comment, "Perhaps not the most elegant solution" :D
It's been maintenance-free for years though, and since its introduction there were 0 rollout-related 502s.
Yeah, it's been working for me too! Feels weird but if it works it works I guess
I'm not sure why they state "although the AWS Load Balancer Controller is a fantastic piece of software, it is surprisingly tricky to roll out releases without downtime."
The AWS Load Balancer Controller uses readiness gates by default, exactly as described in the article. Am I missing something?
Edit: Ah, it's not by default, it requires a label in the namespace. I'd forgotten about this. To be fair though, the AWS docs tell you to add this label.
Yes, that is what we thought as well, but it turns out that the there is still a delay between the load balancer controller registering a target as offline and the pod actually being already terminated. We did some benchmarks to highlight that gap.
You mean the problem you describe in "Part 3" of the article?
Damn it, now you've made me paranoid. I'll have to check the ELB logs for 502 errors during our deployment windows.
Exactly! We initially received some sentry errors that triggered our curiosity.
I think the "label (edit: annotation) based configuration" has got to be my least favorite thing about the k8s ecosystem. They're super magic, completely undiscoverable outside the documentation, not typed, not validated (for mutually exclusive options), and rely on introspecting the cluster and so aren't part of the k8s solver.
AWS uses them for all of their integrations and they're never not annoying.
I think you mean annotations. Labels and annotations are different things. And btw. Annotations can be validated and can be typed. With validation webhooks.
This is actually a fascinatingly complex problem. Some notes about the article: * The 20s delay before shutdown is called “lame duck mode.” As implemented it’s close to good, but not perfect. * When in lame duck mode you should fail the pod’s health check. That way you don’t rely on the ALB controller to remove your pod. Your pod is still serving other requests, but gracefully asking everyone to forget about it. * Make an effort to close http keep-alive connections. This is more important if you’re running another proxy that won’t listen to the health checks above (eg AWS -> Node -> kube-proxy -> pod). Note that you can only do that when a request comes in - but it’s as simple as a Connection: close header on the response. * On a fun note, the new-ish kubernetes graceful node shutdown feature won’t remove your pod readiness when shutting down.
With health I presume you mean readiness check. right? Otherwise it will kill the container when the liveness check fails.
By health check, do you mean the kubernetes liveness check? Does that make kube try to kill or restart your container?
More likely they mean "readiness check" - this is the one that removes you from the Kubernetes load balancer service. Liveness check failing does indeed cause the container to restart.
Yes sorry for not qualifying - that’s right. IMO the liveness check is only rarely useful - but I've not really run any bleeding edge services on kube. I assume it’s more useful if you actually working on dangerous code - locking, threading, etc. I’ve mostly only run web apps.
liveness is great for java apps that spend all their time fencing locks. I've seen too many completely deadlock.
>The truth is that although the AWS Load Balancer Controller is a fantastic piece >of software, it is surprisingly tricky to roll out releases without downtime.
20 years ago we used simple bash scripts using curl to do rest calls to take one host out of our load balancers, then scp to the host and shut down the app gracefully, and updated the app using scp again, then put it back into the load balancer after testing the host on its own. we had 4 or 5 scripts max, straightforward stuff..
They charge $$$ and you get downtime in this simple scenario ?
I used to work in this world, too. What is described here about EKS/K8s sounds tricky but it is actually pretty simple and quite a lot more standardized than what we all used to do. You have two health checks and using those, the app has total control over whether it’s serving traffic or not and gives the scheduler clear guidance about whether or not to restart it. You build it once (20 loc maybe) and then all your apps work the same way. We just have this in our cookie cutter repo.
Does this or any of the strategies listed in the comments properly handle long lived client connections? It's sufficient enough to wait for the LB to stop sending traffic when connections are 100s of ms or less but when connections are minutes or even hours long it doesn't work out well.
Is there a slick strategy for this? Is it possible to have minutes long pre-stop hooks? Is the only option to give client connections an abandon ship message and kick them out hopefully fast enough?
Might be noteworthy that in recent enough k8s lifecycle.preStop.sleep.seconds is implemented https://github.com/kubernetes/enhancements/blob/master/keps/... so no longer any need to run an external sleep command.
The fact that the state of the art container orchestration system requires you to run a sleep command in order to not drop traffic on the floor is a travesty of system design.
We had perfectly good rolling deploys before k8s came on the scene, but k8s insistence on a single-phase deployment process means we end up with this silly workaround.
I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system. I'm pretty sure it could still have had a 2 phase pod shutdown by encoding a timeout on the first stage. Sure, it would have made some internals require more complex state - but isn't that the point of k8s? Instead everyone has to rediscover the sleep hack over and over again.
In fairness to Kubernetes, this partially due to AWS and how their ALB/NLB interact with Kubernetes. So, when Kubernetes starts to replace Pods, the Amazon ALB/NLB Controller starts reacting, however, it must make calls to Amazon API and wait for ALB/NLB to catch up with changing state of the cluster. Kubernetes is not aware of this and continues on blindly. If Ingress Controller was more integrated into the cluster, you wouldn't have this problem. We run Ingress-Nginx at work instead of ALB for this reason.
Thus, this entire system of "Mark me not ready, wait for ALB/NLB to realize I'm not ready and stop sending traffic, wait for that to finish, terminate and Kubernetes continues with rollout."
You would have same problem if you just started up new images in autoscaling group and randomly SSH into old ones and running "shutdown -h now". ALB would be shocked by sudden departure of VMs and you would probably get traffic going to old VMs until health checks caught up.
EDIT: Azure/GCP have same issue if you use their provided ALBs.
Nginx ingress has the same problem, it's just much faster at switching over when a pod is marked as unready because it's continuously watching the endpoints.
Kubernetes is missing a mechanism for load balancing services (like ingress, gateways) to ack pods being marked as not ready before the pod itself is terminated.
They are a few warts like this with core/apps controllers. Nothing unfixable within general k8s design imho but unfortunately most of the community have moved on to newer shinier things
It shouldn't. I've not had the braincells yet to fully internalize the entire article, but it seems like we go wrong about here:
> The AWS Load Balancer keeps sending new requests to the target for several seconds after the application is sent the termination signal!
And then concluded a wait is required…? Yes, traffic might not cease immediately, but you drain the connections to the load balancer, and then exit. A decent HTTP framework should be doing this by default on SIGTERM.
> I yelled into the void about this once and I was told that this was inevitable because it's an eventually consistent distributed system.
Yeah, I wouldn't agree with that either. A terminating pod is inherently "not ready", that not-ready state should cause the load balancer to remove it from rotation. Similarly, the pod itself can drain its connections to the load balancer. That could take time; there's always going to be some point at which you'd have to give up on a slowloris request.
The fundamental gap in my opinion, is that k8s has no mechanism (that I am aware of) to notify the load balancing mechanism (whether that's a service, ingress or gateway) that it intends to remove a node - and for the load balancer to confirm this is complete.
This is how all pre-k8s rolling deployment systems I've used have worked.
So instead we move the logic to the application, and put a sleep in the shutdown phase to account for the time it takes for the load balancer to process/acknowledge the shutdown and stop routing new traffic to that node.
K8s made simple things complicated, yet it doesn't have obvious safety (or sanity) mechanisms, making everyday life a PITA. I wonder why it was adopted so quickly despite its flaws, and the only thing coming to my mind is, like Java in 90s: massive marketing and propaganda that it's "inevitable"..
> put a sleep in the shutdown phase to account for the time it takes for the load balancer to process/acknowledge the shutdown and stop routing new traffic to that node.
Again, I don't see why the sleep is required. You're removed from the load balancer when the last connection from the LB closes.
That’s how you’d expect it to work, but that’s not how pod deletion works.
The pod delete event is sent out, and the load balancer and the pod itself both receive and react to it at the same time.
So unless the LB switchover is very quick, or the pod shutdown is slow - you get dropped requests - usually 502s.
Try googling for graceful k8s deploys and every article will say you have to put a preStop sleep in
Most http frameworks don't do this right. They typically wait until all known in-flight requests complete and then exit. That's usually too fast for a load balancer that's still sending new requests. Instead you should just wait 30 seconds or so while still accepting new requests and replying not ready to load balancer health checks, and then if you want to wait additional time for long running requests, you can. You can also send clients "connection: close" to convince them to reopen connections against different backends.
> That's usually too fast for a load balancer that's still sending new requests.
How?
A load balancer can't send a new request on a connection that doesn't exist. (Existing connections being gracefully torn down as requests conclude on them & as the underlying protocol permits.) If it cannot open a connection to the backend (the backend should not allow new connections when the drain starts) then by definition new requests cannot end up at the backend.
[dead]
K8S is overrated, it's actually pretty terrible but everyone has been convinced it's the solution to all of their problems because it's slightly better than what we had 15 years ago (Ansible/Puppet/Bash/immutable deployments) at 10x the complexity. There are so many weird edge cases just waiting to completely ruin your day. Like subPath mounts. If you use subPath then changes to a ConfigMap don't get reflected into the container. The container doesn't get restarted either of course, so you have config drift built in, unless you install one of those weird hacky controllers that restarts pods for you.
It's not slightly better it's way better than Ansible/Puppet/Bash/immutable deployments, because everything follow the same paterm and is standard.
You get observability pretty much for free, solution from 15 years ago were crap, remember Nagios and the like?
Old solutions would put trash all over the disk in /etc/. How many time did we have to ssh to fix / repair stuff?
All the health check / load balancer is also much better handled on Kubernetes.
[dead]
I wouldn't throw away k8s just for subPath weirdness, but I hear your general point about complexity. But if you are throwing away Ansible and Puppet, what is your solution? Also I'm not entirely sure what you are getting at with bash (what does shell scripting have to do with it?) and immutable deployments.
That's only one example of K8s weirdness that can wake you up at 3am. How: change is rolled out during business hours that changes service config inside ConfigMap. Pod doesn't get notified or reload this change. Pod crashes at night, loads the new (bad/invalid) config, takes down production. To add insult to injury, the engineers spend hours debugging the issue because it's completely unintuitive that CM changes are not reflected ONLY when using subPath.
That's totally valid. I understand the desire of k8s maintainers to prevent "cascading changes" from happening, but this one is a very reasonable feature they seem to not support. There's a pretty common hack to make things restart on a config change by adding a pod annotation with the configmap hash:
But I agree that it shouldn't be needed. There should be builtin and sensible ways to notify of changes and react.This is argument for 12Factor and Env Vars for Config.
Also, Kustomize can help with some of this since it will rotate the name of ConfigMaps so when any change happens, new ConfigMap, new Deployment.
That's how I do it, with kustomize. Definitely confused me before I learned that, but hasn't been an issue for years. And if you don't use kustomize, you just do... What was it kubectl rollout? Add that to the end you deploy script and you're good.
I told you that I hear you on K8s complexity. But since you throw out Ansible/Puppet/etc., what technology are you advocating?
[dead]
Nit: "How we archived" subheading should be "How we achieved".
Thanks, fixed
We’re using Argo rollouts without issue. It’s a super set of a deployment with configuration based blue green deploy or canary. Works great for us and allows us to get around the problem laid out in this article.
Argo Rollouts is an extra orchestration layer on top of a traffic management provider. Which one are you using? If you use the ALB controller you still have to deal with pod shutdown / target deregistration timing issues.
https://argoproj.github.io/argo-rollouts/features/traffic-ma...
We’re using the alb controller to expose our kind: Rollouts. The blue green configuration has some sort of delay before cutting over which prevents any 5xx class errors due to target groups (at least for us)
highly recommend porter if you are a startup who doesn't wanna think about things like this
https://www.porter.run/
[dead]
[dead]
somewhat related https://architect.run/
> Seamless Migrations with Zero Downtime
(I don't work for them but they are friends ;))