Managed services offer big benefits over software. With CF, new stacks, change sets, updates, rollbacks and drift detection are an API call away.
Managed service providers offer big benefits over software. With CF and AWS support, help with problems are a support ticket away.
Using a single cloud provider has a big benefit over a multi-cloud tooling. I only run workloads on AWS, so the CF syntax, specs and docs unlocks endless first party features. A portable Terraform + Kubernetes contraption is a lowest common denominator approach.
Of course everything depends.
I've configured literally 1000s of systems with CloudFormation with very few problems.
I have seen Terraform turn into a tire-fire of migrations from state files to Terraform enterprise to Atlantis that took an entire DevOps team to care for.
Main use case of Terraform is not portability. Have fun porting your SQS queue, DynamoDB table or VPC config to GCP or Azure equivalents. It won't look like similar at all except the resource name. However, If you are only running containers and virtual machines, sure you can benefit from portability
Cloudformation lags features so bad that you end up hacks or Lambda functions with custom resources. DynamoDB Global tables took 4/5 years to be available in Cloudformation.
I've also seen wrongly constructed Cloudformation deleting critical databases, hanging (often), times out, rollback hanging not succesful, so it's not always rainbows & sunshine there either. Also I don't like Terraform for its usage of excessive state files, handling state files with DynamoDB locks, having them on S3.
I won't deny the good features of it like being managed is a huge plus, but it's so slow and lagging behind, and it's YAML is so verbose an has stack size limits, it's always a workaround with Cloudformation. My company uses it by abstraction for internal PaaS and deployment automation and it takes a lot for trivial changes to complete.
So in short, neither is perfect, but for me Terraform is easier to use, easier to debug, and faster, and features don't lag nearly as CF. Those are good enough reasons to avoid Cloudformation for me. I also don't like CDK because it's too verbose and its still CF, and I would rather generate Terraform JSONs/HCLs myself if I need more logic either.
Terraform also helps when you need to configure multiple stacks i.e for a service you can have a module that reflects your organizational best practices, a Fargate service for runnning container, automatic Datadog dashboards, cloudwatch alarms connected to Opsgenie/Pagerduty etc.
> hanging (often), times out, rollback hanging not succesful
The timeouts in CF are ridiculous. Especially with app deployments. I can't remember which service it is, but something can wait up to 30min on a failed `apply` and then wait the same on a failed revert. Only then you can actually deploy the next version (as long as it wasn't part of the first deploy, then you get to wait until everything gets deleted as well).
(yes, in many cases you can override the timeouts, but let's be honest - who remembers them all on the first run or ever?)
I've been using CF for a few years now with minimal complaints but I just hit a create changeset endless timeout (2 days to finally time out).
The worst part is that there are no error messages. When it fails and I click "Details", it takes me to the stack page and shows 100% green. Support ticket seems slow to get a response too.
That aside, my overall experience has been positive!
Terraform has its fair share of lag as well. One particular case that irks me is that the "meta" field on Vault tokens is unsupported. Vault of course being another first-class Hashicorp product makes this particularly odd.
That being said the Vault provider is open source and it's quite easy to add it and roll your own.
If I'm remembering correctly, I'm pretty sure the Vault provider for Terraform was originally contributed by an outside company rather than inside HashiCorp. My guess would be that it has encountered what seems to be a common fate for software that doesn't cleanly fit into the org chart: HashiCorp can't decide whether the Vault team or the Terraform team ought to own it, and therefore effectively nobody owns it, and so it falls behind.
Pretty sure that's the same problem with CloudFormation, albeit at a different scale: these service teams are providing APIs for this stuff already, so do they really want to spend time writing another layer of API on top? Ideally yes, but when you have deadlines you've gotta prioritize.
I really don't get why features come so late to CloudFormation - I guess AWS don't use much CloudFormation internally then, but surely they're not stringing together AWS cli calls? CDK is reasonably new too, unless they waited a long time to go public with it.
Many/most teams internally using AWS use CloudFormation; an AWS service I was a part of was almost entirely built on top of other AWS services, and the standard development mechanism is to use CFN to maintain your infra. You only do drastic things like "stringing CLI calls" if there's something missing from CFN and not coming out soon enough, in which case maybe someone writes a custom CFN resource and you run it in your service account.
Depending on how old the service is, the ownership of the CFN resource may be on the CFN service team (from back when they white-gloved everything) in which case there are massive development bottlenecks (there are campaigns to migrate these to service team ownership) or more often the resource is maintained by the service team itself, in which case the team may not be properly prioritizing CFN support. There can be a whole lot of crap to deal with for releasing a new CFN _resource_, though features within a resource are relatively easy.
On my last team, we did not consider an API feature delivered until it was available in CFN, and we typically turned it around within a couple of weeks of general API availability.
CDK is a higher-level (and awesome in my experience) way to just generate CloudFormation specs. In other words, you need both CloudFormation and CDK support for features to become available there.
In terms of getting new features fast CDK is strictly worse than CloudFormation.
Agree. CF is not a magic bullet, but neither is ansible or terraform.
We used ansible heavily with AWS for 2 years. Then we decided to gut it out and do CF directly. Why? If we want to switch clouds, it's not like the ansible or terraform modules are transferable ... So might as well go the native supported route.
I agree with the article, messages can be cryptic, but at the end of the day, I have a CF stack that represents an entity. I can blow away the stack, and if there's any failure or issue, I can escalate my permissions and kill it again. Still a problem? Then it's AWS's fault and a ticket away (though I've only had to do this once in 5 years and > 150,000 CF stacks.
I also would argue, if a stack deletion stalls development, you are probably using hard-coded stack names, which isn't wise. Throw in a "random" value like a commit or pipeline identifier.
I've had far less issues with CF than terraform or ansible. I have yet to see CF break backward compatibility, while I had a nightmare day when I couldn't run any playbooks in ansible because the module had a new required parameter on a minor or patch version bump.l (which was when I called it quits on ansible, I then relooked at terraform, and decided to go native)
I will caveat that our use case for AWS involves LOTS of creation and deletion, so I find it super helpful to manage my infrastructure in "stacks" that are created and deleted as a unit.. I dont need to worry about partial creations or deletions.. like ever... It basically never fails redoing known-working stuff... Only "first time" and usually because we follow least-privilege heavily
Yes Ansible does have extensions and can be used to provision AWS services.
The approach between Cloudformation/Terraform/Pulumi and Ansible are entirely different though.
The former are declarative, they define how the end state should look. Ansible is a task runner, you define a set of manual tasks it needs to execute to get to the end state.
I strongly advice against using Ansible for provisioning resources. It's idempotent by convention only. When I had to reluctantly use it for jobs it was extremely difficult to get a repeatable deterministic environment set up. Each execution lead to a different state and was just a nightmare to deal with.
Cloudformation/Terraform/Pulumi are much better in that regard as they generate a graph of what the end state should be, check the current state, generate an execution plan how to make the current state look like the target state.
Where Ansible is better than Cloudformation/Terraform/Pulumi is you have a bunch of servers already set up and you want to change the software installed/configuration on them. That's bit of an anti pattern these days changing config/provisioning at runtime. You can change that slightly and use Ansible with Packer to generate pre-baked images which works ok if you don't mind lots of yaml. This isn't to bad and works reasonably well and works to Ansible strengths all though these days most people don't prebake images with containerization. Also if you are only using Ansible for provisioning config on a host Nix achieves this much more elegantly / reliably.
We already used ansible for other things, so it wasn't too hard to swap over to AWS modules... (Except they were inconsistent and poorly supported, we ultimately found out)
Someone at Hashicorp then convinced mgmt that terraform is almost a write-once system, and we could jump from AWS to Azure or GCP easily "just change the module!"... When actual engineers looked at it, after 3 days there was almost a mutiny and we rejected terraform mostly based on the fact someone lied to our managers to try and get us to adopt it... I know someone who is very happy with terraform nowadays, but that ship sailed for us.
Those were basically the only people in this space, so we started rewriting ansible to CloudFormation. Since we mostly use lambdas to trigger the creation of CF stacks, this really works well for us, since our lambdas exist for less than a second to execute, and then we can check in later to see if there's issues (which is less than 1 in 50,000? 100,000? in my experience... Except for core AWS outages which are irrespective of CF). Compared to our ansible (and limited terraform) setups which required us to run servers or ECS tasks to manage the deploy. We can currently auto-scale at lambda scale-up speed to create up to 30 stacks a second if demand surges (the stack might take 2-3 minutes to be ready, but it's now async). Under ansible/terraform we had to make more servers and worker nodes to watch the processes... And our deployment was .3/.4 stacks per minute per worker (and scaling up required us to make more workers before we could scale up for incoming requests)
If I was building today, I'd probably revisit terraform, but I think the cdk or CF are still what I'd recommend unless there's a need for more-than-AWS... E.g. multi-cloud deployments, or doing post-creation steps that can't be passed in by userdata / cloud-init.. in which case CF can't do the job alone and might not be the right tool.
I'm a big proponent of CF when you are using AWS, but if you are on GCP, don't even bother with their managed tool, just go straight to TF. Their Deployment Manager is very buggy (or at least it was 2 years ago).
CloudFormation/Terraform/etc are also configuration management programs. They just work on the APIs of cloud vendors, rather than a package management tool's command-line options. They've been given a new name because people want to believe they're not just re-inventing the wheel, or that operating on cloud resources makes their tool magically superior.
> We used ansible heavily with AWS for 2 years. Then we decided to gut it out and do CF directly. Why? If we want to switch clouds, it's not like the ansible or terraform modules are transferable ... So might as well go the native supported route.
>
> I agree with the article, messages can be cryptic, but at the end of the day, I have a CF stack that represents an entity. I can blow away the stack, and if there's any failure or issue, I can escalate my permissions and kill it again. Still a problem? Then it's AWS's fault and a ticket away (though I've only had to do this once in 5 years and > 150,000 CF stacks.
Another killer feature is StackSet. I managed to rewrite datadog integration CF (their version required manual steps) to a template that contained custom resources that made calls to DD to do registration on their side.
I then deployed such template through StackSets and bam, every account in specific OU automatically configures itself without any manual steps.
> Managed services offer big benefits over software. With CF, new stacks, change sets, updates, rollbacks and drift detection are an API call away.
>
> Managed service providers offer big benefits over software. With CF and AWS support, help with problems are a support ticket away.
The problem is when those help tickets get responses like “try deleting everything by hand and see if it recreates without an error next time”. They've worked on CloudFormation over the last year or but everyone I've known who's switched to tools like Terraform did so after getting tired of unpredictable deployment times or hitting the many cases where CloudFormation gets itself into an irrecoverable state. I can count on no fingers the number of development teams who used CF and didn't ask for help recovering from an error state in CF which required out-of-band remediation.
I believe they've also gotten better at tracking new AWS features but there were multiple cases where using Terraform got you the ability to use a feature 6+ months ahead of CF.
> A portable Terraform + Kubernetes contraption is a lowest common denominator approach.
Terraform is much, much richer than CloudFormation so I'd compare it to CDK (with the usual aesthetic debate over declarative vs. procedural models) and it doesn't really make sense to call it LCD in the same way that you might use that to describe Kubernetes because it's not trying to build an abstraction which covers up the underlying platform details. Most of the Terraform I've written controls AWS but there's a significant value to also being able to use the same tool to control GCP, GitLab, Cloudflare, Docker, various enterprise tools, etc. with full access to native functionality.
Terraform (and kubernetes) itself aren't a lowest common denominator, however I believe the comment alludes to an approach where you try to abstract cloud features. This can (kind of) reasonably be done with terraform and kubernetes and avoiding vendor specific services such as various ML services, DynamoDB, etc.
However, you can use terraform just fine while still leveraging vendor specific services that actually offer added value, like DynamoDB or Lambda. Cloudformation however doesn't really offer that much added value (if any) over terraform, so using terraform isn't an LCD approach perse.
Yes — that's basically what I was thinking: you could make an argument that using Kubernetes inherently adds an abstraction layer which might not be preferable to using platform-native components but it sounded like the person I was responding to was making the argument that using Terraform requires that approach.
I found that especially puzzling because one of the reasons why we switched to Terraform was because it let us take advantage of new AWS features on average much faster than CloudFormation.
> Managed services offer big benefits over software.
TF can be used as a managed service.
> Managed service providers offer big benefits over software. With CF and AWS support, help with problems are a support ticket away.
The same is true with TF, except 100000% better unless you're paying boatloads of money for higher tiered support.
> I only run workloads on AWS, so the CF syntax, specs and docs unlocks endless first party features.
CF syntax is an abomination. Lots of the bounds of CF are dogmatic and unhelpful.
> I have seen Terraform turn into a tire-fire of migrations from state files to Terraform enterprise to Atlantis that took an entire DevOps team to care for.
CF generally takes an entire DevOps team to care for, for any substantial project.
Sure, but I never seen that myself. If TF was used it was always own set up infrastructure at best.
> The same is true with TF, except 100000% better unless you're paying boatloads of money for higher tiered support.
Again, all places I worked had enterprise support and even rep assigned. I think I only used support for CF early on, I don't know if it was buggier back then or I just understood it better and didn't run into issues with it.
> CF syntax is an abomination. Lots of the bounds of CF are dogmatic and unhelpful.
I would agree with you if you were talking about JSON, but since they introduced YAML it is actually better than HCL. One great thing about YAML is that it can be easily generated programmatically without using templates. Things like Troposphere make it even better.
> CF generally takes an entire DevOps team to care for, for any substantial project.
Over nearly 10 years of my experience, I never seen that to be a case. I'm currently in a place that has an interesting approach: you're responsible for deployment of your app, so you can use whatever you want, but you're responsible for it.
So now I'm working with both. And IMO I see a lot of resources that are not being cleaned up (because there's no page like CF has, people often forget to deprovision stuff), also seeing bugs like for example TF needs to be run twice (I think last time I've seen it fail was that it was trying to set tags on a resource that wasn't fully created yet).
There is also situation that CF is just plain better. I mentioned in another comment how I managed to get datadog integration through a single CF file deployed through stackset (this basically ensured that any new account is properly configured). If I would end up using TF for this, I would likely have to write some kind of service that would listen for events from the control tower, whenever a new account was added to OU, then run terraform to configure resources on our side and make API call to DD to configure it to use them.
All I did was to write code that generated CF via troposphere and deploy it to stackset in a master account once.
Right, your post is mostly "I like the thing that I've used, and I do not like the thing I haven't used". They're apples and different apples.
> Again, all places I worked had enterprise support and even rep assigned
So, again, you've worked at places that were deeply invested in CF workflows.
> but since they introduced YAML it is actually better than HCL. One great thing about YAML is that it can be easily generated programmatically without using templates.
Respectfully, this is the first-ever "yaml is good" post I think I've ever seen.
> Over nearly 10 years of my experience, I never seen that to be a case. I'm currently in a place that has an interesting approach: you're responsible for deployment of your app, so you can use whatever you want, but you're responsible for it.
I'd love to hear more about this.
> And IMO I see a lot of resources that are not being cleaned up (because there's no page like CF has, people often forget to deprovision stuff), also seeing bugs like for example TF needs to be run twice (I think last time I've seen it fail was that it was trying to set tags on a resource that wasn't fully created yet).
I guess we're just ignoring CF's rollback failures/delete failures/undeletable resources that require support tickets then?
> There is also situation that CF is just plain better. I mentioned in another comment how I managed to get datadog integration through a single CF file deployed through stackset (this basically ensured that any new account is properly configured). If I would end up using TF for this, I would likely have to write some kind of service that would listen for events from the control tower, whenever a new account was added to OU, then run terraform to configure resources on our side and make API call to DD to configure it to use them.
Again respectfully, yes, the person that both doesn't like and hasn't invested time into using Terraform at scale probably isn't going to find good solutions for complicated problems with it.
While this is true and AWS support is very responsive and useful, it doesn't mean they solve all the problems. Sometimes their help is: "I'll note that as a feature request, in the meantime you can implement this yourself using lambdas".
CloudFormation can be very flexible, especially with tools like sceptre, it can work very well. A huge issue is that WITHOUT tools like sceptre, you can't really use stacks except as dumb silos. You already need additional tooling (sceptre, CDK, SAM, ...) to make CF workable. I think that most people who despise CF haven't got good tooling.
The issue with CloudFormation is that it lags behind all other AWS services quite often. It seems to be maintained by another team. I realize that getting state management for complex inderdependent resources right requires time and diligence, BUT it's a factor in driving adoption.
- New EKS feature? Sorry, no knob in CF for MONTHS.
- New EBS gp3 volume type available? Sorry, our AWS::Another::Service resource does not accept gp3 values for months after feature release.
- A AWS API exposes information that you would like to use as an input to another resource or stack. SURPRISE, CloudFormation does not return that attribute to you, despite it being available. SORRY NOT FIXABLE.
- Try refactoring and moving resources in/out of stacks while preserving state? Welcome to a fresh hell.
- Quality of CloudFormation "modules" varies. AWS::ElasticSearch::Domain used to suck a whole lot, AWS::S3::Bucket as a core service was always very friendly.
- CloudFormation custom resources are a painful way of "extending" CF to suite your needs. So painful that I refuse to pay the cost of AWS not keeping their stuff up to date and well integrated.
These kinds of lag, this kind of incompleteness when it comes to making information from AWS APIs available have driven me to Terraform for things that are best done in Terraform and require flexibility and CloudFormation for things that work well with it.
At the end of the day CF is a closed-source flavour of Terraform + AWS provider. I would like have gone all in, but just doesn't work and costs hacks, time and flexibility.
That being said, if you have no idea how to work with TF, tell the devs to use CF.
The experience of AWS support is probably also very different when it comes to feature requests. An org that spends half a billion with AWS will get their requests implemented ASAP whereas small fish have to swim with the flow and hope it works for them.
Do you understand what is the difference between the things you can express with Dhall or ML vs Go or TS?
People are so hung up because we could do so much better with expressing only valid states and you do not need to deploy your infra to figure out that and S3 bucket cannot have a redirect_only flag set and also a website index document set at the same time.
Managed services offer big benefits over software. With CF, new stacks, change sets, updates, rollbacks and drift detection are an API call away.
Managed service providers offer big benefits over software. With CF and AWS support, help with problems are a support ticket away.
Using a single cloud provider has a big benefit over a multi-cloud tooling. I only run workloads on AWS, so the CF syntax, specs and docs unlocks endless first party features. A portable Terraform + Kubernetes contraption is a lowest common denominator approach.
Of course everything depends.
I've configured literally 1000s of systems with CloudFormation with very few problems.
I have seen Terraform turn into a tire-fire of migrations from state files to Terraform enterprise to Atlantis that took an entire DevOps team to care for.