Dependency managers (like Apache Maven) changed the way software is being developed: using those makes it much easier to use dependencies in your application (you don’t have to fetch them manually yourself). As a result, building your own libraries is now less common, using an available (open source) library is much easier.
To get a feel for the scale: NPM serves 2.1 trillion annual downloads (requests). This obviously includes automatic builds, but the growth over the last couple of years has been exponential none the less. And using third party libraries can be a very good thing. Good libraries can be much more secure and have less bugs than some library developed in house.
Types of supply chain attacks:
After this Sean unfortunately dropped from the session.
Reasons to analyze your cloud costs:
You’ll want to be aware when you scale (hyper growth) if your costs grow faster than the number of users/sales/etc. You don’t want to end up in a situation where more success means less profit.
Things to look for in your environment:
Best practices you can arrange from day one:
Customer experience is important for your revenue. You need to deliver your service and make sure you know how well you’re able to do so. Observability helps you find the needle in the haystack, identify the issue and respond before your customers are affected.
Pillars of observability:
Relevant key performance indicators (KPIs) and key result areas (KRAs):
An observability platform gives you more visibility into your systems health and performance. It allows you to discover unknown issues. As a result you’ll have fewer problems and blackouts. You can even catch issues in the build phase of the software development process. The platform helps understand and debug systems in production via the data you collected.
AIOps applies machine learning to the data you’ve collected. It’s a next stage of maturity. Its goal is to create a system with automated functions, freeing up engineers to work on other things. Automating remediation of issues can also greatly reduce response time and mean time to repair. This means that the customer experience is restored faster (or is never degraded to begin with).
Failure can cause a deep emotional response, we can get depressed and it can make us physically sick. On one side of the spectrum, a failure can cause harm to other people. On the other side we could embrace failure and make things safe to fail. Failure in IT on the project level is quite common. Failure can also happen on a personal level.
Not all failures are created equally. There are three types:
Steps to learn from failures:
Increase return by learning from every failure, share the lessons and review the pattern of failures. Do note that none of this can happen in an environment without psychological safety. You need to feel safe to discuss your failure, doubts or questions to be able to learn.
A manufacturing process was monitored and there was a nice dashboard to show whether there were any problems. However, at a certain moment there was a problem, but the dashboard was still claiming everything was fine.
What was going on? Unreliable network? Erratic monitoring system? Flawed collection of metrics? It turned out to be all of them.
Sampling rates, retention policy and network issues can cause missing measurements in your time series database. This missing information can cause a drop in failure rate if you are unlucky enough that the missing samples are failed ones. So you think everything is fine, but there is something wrong in your environment.
Modelling metrics differently can help. One possible improvement is to have a duration sum in your metrics instead of just the duration. If you now miss a sample, the sum will still indicate that there has been a failure.
Using histograms is even better since you place values in buckets, e.g. failures in our case. A disadvantage however is that the metrics creation system must now also know what qualifies as a success or failure.
Takeways:
After improving their metrics, the monitoring system matches the actual state of the manufacturing environment again.
Tip: look into hexagonal architecture, also known as “ports and adapters architecture.”
DNS, TLS and bad config are where failures are waiting to haunt us when we least expect it. We need to have a tool to find the issues in our system early on. Health checks can be this essential tool to alert us.
You are able to narrow down where the issue is if you structure your health checks like this:
Examples of tests you can use:
Health checks are cheap to run and give you a fast overview. They do not replace observability and only tell you something is broken, not what is going on. Start simple and only add synthetics when/where needed since they more complex.
By using Terraform AWS modules you’ll have to write less Terraform code yourself, compared to using the AWS provider resources directly.
For example: if you write your own Terraform code from scratch for an example infrastructure with 40 resources, you’ll need about 200 lines of code. Once you introduce variables, that code base will grow to 1000 lines of code. When you then split it up into modules, you’ll need even more code.
Instead, if you use terraform-aws-modules
you’ll have more features than your
own modules would only need about 100 lines of code.
Questions with regard to terraform-aws-modules
:
tflint
and
terraform validate
) and running Terraform on examples. Anton thinks that testing
code should be for humans (so HCL is better than Go).Most important word for this talk: understandability.
50-67% of time for software projects spent on maintenance.
Some useful links Anton listed:
Why DevOps? It can be pretty complicated to explain, even though it is an obvious choice for Nasdaq. For most people Nasdaq is a stock exchange but it is actually a global technology company. Sure, they run the stock exchange, but also provide capital access platforms and protect the financial system.
Nasdaq develops and delivers solutions (value) for their users. They manage and operate complex systems. It has been around for a while. So again, why DevOps? The answer is: to get better at their practices.
Years ago they had manually configured static servers and the development teams were growing. They automated software deployment to a point where the product owners could trigger the deployment, and even pick which branch to deploy. This was an important first evolution. They had a “DevOps team” to handle this automation.
The second evolution for Nasdaq was moving from a data center to the cloud, using infrastructure as code (IaC). The question they asked themselves was what to do first: migrate to cloud or get their data center infrastructure 100% managed via IaC? They made the ambitious decision to do both at once.
By turning your infrastructure into code, you can create and destroy the environment as many times as you like. And this was welcome: after about 2100 times they “got it right” and were able to move over the production environment to the cloud. Without IaC this would not have been possible as flawlessly as it did.
The cloud and IaC brought them:
Over time the DevOps team started to handle a lot more work. The team consisted of system administators, but they were required to work as developers (make code reusable, use git, etc). The DevOps team started to complain about being overloaded and they became a bottleneck since a lot of development teams came to them with problems (failing builds, cloud questions).
On the other side, the development teams stared to complain because they are dependent on the DevOps team but that team had become the bottleneck. And “just” scaling up the DevOps team would not solve the problem.
Where the second evolution was about the technology, the third evolution was about efficiency. They moved to a “distributed DevOps” model. Developers were empowered: access to logs and metrics, training (cloud, Terraform, Jenkins). By creating a central observability platform, developers could get insight in what is going on, without the need to have access to the production environment.
This resulted in more deployments and enhanced reliability of the deployments because of the observability platform.
A year or three later, new cracks appeared. Standards were diverging because teams were allowed to pick their own path (libraries, databases, pipelines, Terraform code, etc). It also lead to practices that needed to be fixed (e.g. lack of replication). Standardizing this led to quite a burden at the start of a project: lots of basic stuff to setup.
Developers needed to be experts in a lot of technology, from JavaScript and the JavaScript framework in use, via multiple .NET versions to Terraform and other deployment related tech. An easy way to solve this situation was to flip back to the previous situation with a single team responsible for deployment and such. But this would basically mean recreating the bottleneck.
The ownership itself was not the problem. The efficiency was, because of the boilerplate needed for teams. Stuff you want to do the same across the teams. They wanted to empower the development teams, but also give them the standards for databases, messaging, etc. Instead of copying a template and have teams diverge afterwards, they looked into packaging to also make it easier to update afterwards.
This lead to evolution four with marker files, packages, code generators and auto-devops pipelines. The pipeline looks at the markers (“hey, this is a .NET app”) and can then apply a standard pipeline. Nasdaqs code generators create the boilerplate for the teams so within two minutes of starting with a new application you’re able to write code to solve your business problem, instead of having to create boilerplate code yourself first.
The developers can get up and running quickly, but in a safe way.
The development teams are all DevOps teams now, but Nasdaq also has a specialized team for the complex areas (hardware, networking, etc). There is also a “developer experience” team that focusses on the tools for the developers, like the code generators.
Current status with regard to our three key areas:
No matter how reliable our systems are, they are never 100% — an incident can always happen. When the pager goes off, the first step is to recruit a response team. This team can then observe what is going on. They need to figure out what this means (orient themselves) and decide what to do. And finally they can act to resolve the problem. (The OODA loop.)
Getting the right people involved can be hard for the technical responder; they themselves might want to dive into the technical stuff first. This is where the “incident commander” role comes in. The incident responder will recruit the team to get the right people involved, coordinate who does what and handle communication with people outside of the response team. (The latter can also be handled by a dedicated “comms lead” if needed.)
But how does the incident commander get involved?
A fairly standard approach for an on-call system will be to have a tiered model: tier 1 (NOC), tier 2 (people generally familiar with the system) and tier 3 (the experts of the system having the issue). The problem with this model: where does the incident commander come from? Tier 1? If so, can the incident commander follow through if the issue is handed over to the next tier?
Another model (“one at a time”): team A gets involved, decides it is not their responsibility, hands over to team B, which kicks it to team C, etc. Where does the incident commander come from in this model?
The aforementioned models only work in the simplest cases. They share a few big problems: handoffs are hard and there is no ownership, which results in loss of context. To mitigate this, some teams have an “all hands” approach where everyone is paged and everyone swarms into the incident response. However, most people on the call (or in the war room) cannot contribute. This leads to a mentality of “how quickly can I get out of here?”
Yet another approach is an Incident Command System (ICS), which comes from emergency services. In this approach the alert goes to the incident commander who then involves the team. While this works in some organizations, in tech it’s usually a bit too regimented.
The ICS morphed to an “adaptive ICS” where the technical team has more autonomy, but the incident commander is still involved. This system can be scaled up to where there’s an “area commander” role which coordinates separate teams (via their respective incident commanders).
Summarizing the roles of the parties in the “response trio”:
Each role will perform their own OODA loop from their own perspective.
But we started the story in the middle. We need to get back to the beginning and ask the question “why is the pager making noise?” Perhaps the first question one should ask is: “is this something actionable?” If it is not or if it is something you can handle in the morning, perhaps you do not have to respond in the middle of the night.
Cut down the noise and focus on the signal.
PagerDuty had an idea what SRE meant: they were enablers. You can hit them up on Slack and they help you out with a problem. Having an SRE team initially reduced the total minutes in incident in a year. But when PagerDuty grew further, the number went up again. Oops.
The “get well” project required teams to have the following:
Results:
Important elements that made this “get well” project possible:
PagerDuty used Backstage as an internal “one stop shop” developer portal with documentation and insights. It also integrates with the development systems.
]]>When people think about DevOps, they think of CAMS, which stands for:
The term CAMS was formalized by John Willis and he wrote down his idea in the article What Devops Means to Me. This talk will focus on the most important aspect: culture, and specifically organizational learning.
Culture as described by Simon Sinek:
The 5 disciplines from organizational learning:
Automation is about automating the right things. You don’t want to do it just to automate things, but it has to fit in the system. So you have to start with culture before you think about automation. The ultimate goal is GitOps where everything happens via a pull request.
Measurement started out with a focus on the tooling. But measuring how you work is as important as measuring your infrastructure.
Important key metrics, from DORA:
Share how you are doing (in) DevOps. This is how we ended up here now. Share how you are improving your own organization. But also share information within your organization: documentation, videos, presentations, open spaces, lean coffee sessions
CAMS originated in 2010. The three ways are principles that came out of the book The Phoenix Project. They are:
If we map these to CAMS:
Working on your culture is as important as doing your actual work.
So what’s next? CAMS is still as applicable as 10 years ago. It has always been important, but it was only put into words in 2010. We need to continue sharing to get the full value out of it.
We think of Usain Bolt as the record breaking athlete, but he’s also the person that worked really hard to get there. It takes a lot of time and effort to become good at something. Pilots and firemen spend most of their time training and not doing what you expect them to do; just to make sure they perform well under pressure. Also note that pilots use a lot of checklists to prevent mistakes.
We should do the same: train for when our platform is in an error state. We should not just be able to detect it, but also solve the problem.
Chaos engineering is the discipline of experimenting on a distributed system
in order to build confidence in the system’s capability to
withstand turbulent conditions in production.
This
practice started at Netflix with Chaos Monkey.
We need to become comfortable with experimenting. Have game day exercises and analyze what happened, to improve your training. Do not just focus on the result of the exercise itself, but also ask questions like “was it the right experiment?”
Now that we use containers, add sidecar containers with tools to get metrics or detect errors. Or to do chaos engineering e.g. with Toxiproxy.
Since checklists are boring, we can use gamification to spice things up. Celebrate failure, and learn from it!
Living in the year 3000: breaking production on purpose on Saturdays and have the system remedy the problem itself.
Convince management that failure is normal and expected behaviour. Promising 100% uptime is not realistic. Large, complex systems will always be in a (somewhat) degraded state.
Let engineers be scientists to deal with this complex environment. Give them training, allow them to do tests (experiments), which results in having valid monitoring that lead to actionable alerts. Get the engineers in a state where they are comfortable with failures.
(Slides)
Each AWS service has it’s own price components. It’s a complex subject. Even a simple service like a load balancer has multiple components. It looks simple with the “$0.008 per LCU-hour” price tag, but now you have to figure out what an LCU-hour is. Then you learn it has four dimensions that are measured: number of new connections per second, active connections, processed bytes and rule evaluations. Good luck predicting the costs.
This presentation only sticks to the basics since cost optimization it such a big topic.
To start to manage/reduce your costs, you need to enable billing and costs for your DevOps team. If you cannot measure your costs, you cannot manage it. Note that you do not need an excessive amount of tags for cost management. First you need to figure out what you are going to change and how it’s going to affect the bill for your company.
What patterns can we avoid? In most organizations, the most expensive parts of your bill will be:
So we will dive into these subjects.
Tips to reduce costs:
We usually forget to take data transfer costs into account upfront. It’s also a complex subject and there are a lot of considerations to make.
Tips:
Initially it was simple: there were only two storage classes. Currently there are 6 different classes with their own prices and characteristics.
The best option is to use lifecycle rules to move data between different
classes. Note that you in a lifecycle policy you cannot filter the objects based
on an extension (e.g. *.jpg
) but instead you need to think “from left to
right.” So you need to think upfront about the prefixes you are going to want to
use in your bucket.
Managed services can offer you automatic and manual backups, which can be great. But what is the cost of that? Check how much retention you need for example.
With regard to EBS: for most use cases gp3
is better and cheaper than gp2
(except for very large volumes). Note that you can change the EBS volume type
without stopping the machine. In most cases you can have a 20% cost saving
without affecting your performance.
After each AWS re:Invent, your deployment is probably outdated with regard to cost optimizations. Examples:
gp3
instead of gp2
for your EBS volumes.m6
might be more interesting than the m5
or m4
you may
currently be using.So keep up to date with the offerings.
Call for Code is a multi year program launched in 2018 to address humanitarian issues and help bridge potential solutions. Last year the global challenge was around climate change and a track was added for the social and business impact of the COVID-19 pandemic.
It’s not just about generating ideas to take on the issues. It should eventually also lead to an adopted open source solution that is sustainable.
The 14 projects discussed today can be found on Call for Code page on the Linux Foundation website. You can also read about them via the IBM developer site. GitHub is central for how they iterate on the features. The related organizations are :
The key takeaway for this session is to make us help improve how the projects do DevOps. The goal is to ensure that everyone can contribute to the projects, can do so with confidence and the projects can be deployed with speed.
The Call for Code for Racial Justice open source projects are categorised in three pillars:
Demi talked about each of these projects in the program and their tech stacks. You can read more about them on the Call for Code for Radical Justice section on the IBM developer site.
Daniel in turn talked about other Call for Code projects:
Even if you cannot code, there are numerous ways you can contribute, e.g. conducting user research, write/review documentation, do design work, advocacy like speaking at conferences.
There are multiple ways to get involved:
Chris shared 10 DevSecOps failures and talked about how to change the culture and turn these failures into successes.
This quote nicely sums it up:
To change the culture:
The problem is that nothing ever gets done with this infinity graph. It’s not an accurate representation.
The solution is to talk about pipelines instead and integrating security into them. Code review should include security, vulnerability scanning should be part of the pipeline, etc. Ban the infinity graph.
Creating a specific security team is the opposite of what DevOps is about. It’s about working together. Security isn’t a specialty, it is the responsibility of everybody. This requires knowledge and expertise.
The other way around is also true: teach security people to code. They don’t have to become great coders, but it would be nice if they can review code and make suggestions to make things more secure.
Sometimes we let vendors define what DevOps and security are for us, via the products that they offer. It would be better to find the best of breed outside of the offering of cloud provider. Take a vendor independent approach and determine what DevOps means to you.
Looking at the big companies can be discouraging. You most likely have not invested the same time in it as e.g. Netflix, Etsy, etc. have. So while you won’t be at the same level, don’t see this as an excuse to give up. Do the DevOps that you do. Don’t fixate on the top of the class. Get on that path and make incremental progress.
Use the OWASP DevSecOps Maturity Model to create a roadmap.
This can be a complicated subject (see for example the DevSecOps Reference Architecture from Sonatype).
But keep it simple! Start with a small subset of security tools. Everybody related to your project should be able to explain the build pipeline. Don’t try to solve all problems immediately. Take a phased approach.
Security might want to slow down the pipeline and act as a gatekeeper. Don’t say “no,” but “yes, if…” For example: “yes, that would be a great feature if you enable multifactor authentication.”
Practice empathy. Both security people for developers, but also the other way around.
You buy a tool and enable every option to “get your money’s worth.” The result is 10,000 JIRA tickets of things that need to be fixed. This does not help.
It would be better to tune the tools and don’t waste time with security findings that do not matter.
Start with a minimal policy focussing on the largest issue. Developers will then start to trust the tool and then you can slowly increase the policy.
You scanning tools cannot find business logic flaws; there’s no pattern to it.
Perform threat modelling outside of the pipeline. It should be done when new feature assignments go out.
There are lots of vulnerabilities in open source software; this is a supply chain problem.
To improve this: embed software composition analysis (SCA) in all your pipelines. Set the SCA policy to fail when a vulnerability is detected. If you filter out a vulnerability (e.g. because there’s no fix yet), make sure the filter will not be active forever.
The 10 DevOps successes we can distil from the failures above:
Besides being a principal developer advocate at Honeycomb, Liz is also a member of the platform on-call rotation. Honeycomb deploys with confidence up to 14 times a day, every day of the week—so also on Fridays. How do they manage to (mostly) meet their Service Level Objectives (SLOs) while also scaling out their user traffic?
Their confidence recipe:
You need to know how broken is “too broken.” You don’t have to alert on all problems when working at scale. You need to measure success of the service and define SLOs. These are a way to measure and quantify your reliability.
Honeycomb’s jobs is to reliably ingest telemetry, index it, store it safely and let people query it in near-real-time. Honeycomb’s SLOs measure the things that their customers care about.
For example, they have set an SLO that the homepage needs to load quickly (within a few hundred milliseconds) in 99.9% of the times. User queries need to be run successful “only” 99% of the time and are allowed to take up to 10 seconds. On the other hand: ingestion needs to succeed in 99.99% of the time since they only have one shot at it.
Services are not just 100% down or 100% up (most of the time).
These metrics help Honeycomb make decisions about reliability and product velocity. If the service is down too much, they need to invest in reliability (since having features that cannot be used does not add value). On the other hand: if they exceed the SLO, they can move faster.
Practices used by Honeycomb:
For infrastructure Honeycomb also use infrastructure as code practices:
Left-over error budget is used for chaos engineering experiments. This is something where you go test a hypothesis. You need to control the percentage of users affected by it and be able to revert the impact you are causing.
Chaos engineering is engineering. It’s not pure chaos.
This works well for stateless things, but how does it work for stateful things? In the case of the Honeycomb infrastructure, they make sure to only restart one server or service at a time. They do not introduce too much chaos to reduce the likelihood that something goes catastrophically wrong.
Two reasons why you will want to do these experiments at 3 PM and not at 3 AM:
With the experiments they measure if they had an impact on the customer experience. If they cause a change, does the telemetry reflect this? (Is the node indeed reported as being offline, for example?) When you fix things, you need to repeat the experiment and make sure the change indeed fixed the issue.
When you burn the error budget, the SRE book states that you should freeze deploys. Liz disagrees. If you freeze deploys, but continue with feature development, the risk of the next deployment only increases. Instead, Liz advocates for using the team’s time to work on reliability (i.e. change the nature of the work instead of stopping work).
Fast and reliable: pick both!
You don’t have to pick between fast and reliable. In a lot of ways fast is reliable. If you exercise your delivery pipelines every hour of every day, stopping becomes the anomaly instead of deploying.
We start with the basics: what is Infrastructure as Code (IaC)? With the advent of cloud providers, you no longer use hardware, but a UI to stand up infrastructure. This led to shadow IT since developers ran off with a credit card to provision what they needed themselves, instead of using slow, internal IT systems.
Developers however rather write code and use developer methodologies than click through a UI. This is where IaC started. With it you can create and manage infrastructure by writing code.
What are he pitfalls? The bad news: you get all the pitfalls of infrastructure and all the pitfalls of code. But you’ve probably already got a lot of experience with those issues and teams to handle them. You just use a different methodology.
The first pitfall is not fostering the communication between the groups that have experience and tools.
Which framework/tool do you pick? There are basically two categories: multi-cloud or cloud agnostic tools on the one hand (like Terraform and Pulumi) and cloud specific tools on the other (like CloudFormation). Note that for example with Terraform and Pulumi you still have to rewrite code when switching from one cloud provider to another, but at least the tool is familiar.
Security is a huge thing. You still need to know how to design your VPC, IAM policies, security policies, etc. You still need to communicate with all the teams that have the experience. It’s not just Dev and Ops. With tools like Terrascan and Checkov you can shift-left the security aspect instead of trying to bolt it on afterwards.
The biggest thing issue is with default values. If you use the UI, there are a lot of boxes that may be blank or have stuff in them. Some of the boxes can be left blank, for some you need to specify what you want. The UI is going to yell at you; if you use IaC things may be less in your face.
You don’t want to deploy something with an open policy. Open Policy Agent can really help you to make sure you stay within your allowed parameters. For instance you can write a policy to make sure you are only use a specific region, don’t deploy an open S3 bucket or that you only use certain sizes of EC2 instances.
If you hard code certain values in Terraform or other IaC tools, you might need to copy/paste a lot of code if you want to create e.g. a test, acceptance and production environment. To mitigate these DRY (don’t repeat yourself) issues you can for instance use Terraform modules or Terragrunt.
State size can become a problem. If you want to, you can put all of your infrastructure in the same state file (which is where your tool stores the state of the infrastructure). However it means the tool will have to check all resources in the state file to detect if it needs to do something. To mitigate this, you can again use Terraform modules. It helps with performance and makes the codebase more manageable.
The conference is still ongoing while I publish this post. However, this is it for me for All Day DevOps for this year. I learned new things and got inspired.
Thanks to the organizers, moderators and speakers for hosting another great event.
]]>I selected two Terraform workshops and one about AWS Lambda this year.
Mikhail provided a Git repository to use during the workshop: https://github.com/mikhailadvani/terraform-workshop
To dive right in, have a look at this snippet to write a file to the current
directory (in this case the in the part-1
directory of the workshop repo):
resource "local_file" "user" {
content = "..."
filename = "${path.module}/${var.filename}"
}
Terraform does not accept relative paths. But using absolute paths means you’ll probably end up with a path containing a home directory of a specific user, which is annoying for team members that have a different username.
To prevent this, you can use the “${path.module}
” variable just like in the
example above. You can also use Docker to have a standard environment you run
Terraform in. Using Docker can also be a practical approach when you want to use
Terraform in your CI.
You also cannot have one variable refer another (at least, in Terraform 0.11). You’ll have to use a local value, for example:
locals {
filename = "${var.name}.txt"
}
Mikhail demonstrated the use of:
terraform plan -var-file=<filename>
: this allows you to provide values for your
variables (docs).
terraform plan -out=<filename>
: you can save your plan. If you feed the
“terraform apply
” command this file, it will not ask for confirmation since
Terraform assumes the plan has already been reviewed
(docs).
If your (remote) state has changed between generating the plan and running
apply
with that plan, Terraform will detect that and fail.
depends_on: ensure that resource A is created before B.
If you work in a team, you want use something like S3 for the state file and DynamoDB for locks. And you probably want to use Terraform to create the S3 bucket and the DynamoDB table. But to be able to create infrastructure you need to have a state file. Catch-22. To solve this, you’ll have to split this up in two phases:
init
, plan
and apply
. This will
create the resources, but keep the Terraform state locally.terraform init
again.
Terraform now picks up that the state should be stored remotely and offers to
move the current state.Note that Terraform will create resources in parallel.
How do you manage secrets? Mikhail is using a secret.tfvars
file and
git-crypt. The benefit of using
vars files over environment variables is you can put the former in version
control. The official recommendation is to use
Vault (also made by Hashicorp). You can
possibly also use e.g. an AWS service to store your secrets.
If you rename resources or move them into modules, Terraform detects that a
resource is no longer in your code and also picks up the new resource. It will
try to delete the old one and create the new resource. To prevent this, you’ll
have to tell Terraform the resource has moved. Use “terraform state mv
”
(docs) to fix this.
An example of a module using Terratest to test the Terraform code: https://github.com/mikhailadvani/terraform-s3-backend
Tips from audience:
terraform validate
”.There are different types of modules:
A composition is a collection of infrastructure modules. An infrastructure module consists of resource modules, which implement resources.
You can use infrastructure modules e.g. to enforce tags and company standards. You can also use things like preprocessors, jsonnet and cookiecutter in them.
Terraform modules frequently cannot be re-used because they are written for a very specific situation and sometimes have hard coded assumptions in them.
Don’t put provider information in your module. For instance: although you can specify module versions, please do not do this:
# Don't do this!
terraform {
required_providers {
aws = ">= 2.7.0"
}
}
Avoid provisioners in modules. Use public cloud capabilities instead, like user_data. (Mark: for an example, see https://github.com/sjparkinson/terraform-ansible-example/tree/master/terraform)
Traits of good modules:
More info: Using Terraform continuously — Common traits in modules
There are two opposite ways of structuring your code
TerraGrunt reduces amount of code
to do similar things.
It has extra features like execution of hooks and a number
of additional functions. TerraGrunt is opinionated.
Terraform workspaces are the worst feature of Terraform ever. (Provisioner is the 2nd worst.) Workspaces allow us to execute the same set of Terraform configs but with slightly different properties. (For example: if this is the production environment spin up 5 instances, else 1 instance.) Workspaces are not infrastructure as code friendly. You cannot answer, from the code:
Better: use reusable modules instead of workspaces.
Will it help us?
It’s the biggest rewrite of Terraform since its creation. There are backward incompatible changes though.
Main changes:
for_each
)... ? ... : ...
) that work as you expectdepends_on
everywhere)(The Hashicorp blog has a number of articles about Terraform 0.12 if you want to know more.)
Terraform developers write and support Terraform modules, enforce company standards, etc. Terraform users (everyone) use modules by specifying the right values; they are the domain experts but don’t care too much about the inner workings of the Terraform modules.
Terraform 0.12 allows developers to implement more flexible/dynamic/reusable modules. For Terraform users the only benefit is HCL2’s lightweight syntax.
There is a command to check your code for 0.12 compatibility: “terraform 0.12checklist
”. Once everything is fine you can use “terraform 0.12upgrade
”.
See the upgrade guide for
more information.
Note that the 0.12 state file is not compatible with the 0.11 state file. If you have a remote (shared) state, once one member of the team upgrades to 0.12, the whole team needs to upgrade.
Anton has a workshop you can do on your own pace at https://github.com/antonbabenko/terraform-best-practices-workshop. The workshop builds real infrastructure using https://github.com/terraform-aws-modules.
During this section of his session Anton answered a bunch of questions people in the audience had. Tools that were discussed:
Related repository: https://github.com/theburningmonk/getting-started-with-serverless-development-with-lambda-devopsdays-ams
There is a (soft) limit of 1000 concurrent running Lambdas. This can be increased by sending a request to AWS. But even if you convinced Amazon to increase it to for instance 10.000, they only scale up at a rate of 500 per minute. So if you have a workload where you have sudden spikes and need to scale quickly, Lambda might not be what you need.
Note that you can set a “reserved concurrency” on a Lambda; this setting acts as a max concurrency of a function.
Use tags and have a naming convention. E.g. add a “team” tag so you know which team to contact about a Lambda. This is useful when you have many, many functions.
Separate customers also get their own instances to run their Lambdas; they are isolated from each other.
By default functions don’t have access to resources in your VPC. You can add it to your VPC, but currently there is a cold start penalty. This penalty can be several seconds; which is a lot if the Lambda only takes a few milliseconds to run. So you probably should not put your function in a VPC if you do not need it.
With regards to execution: you are billed in 100ms blocks. So if your function takes 60ms to run, there is—from a cost perspective at least—no benefit in speeding up the Lambda.
If you want to debug your functions locally in VS Code and you have
the serverless framework installed locally, you can update your launch.json
file:
{
"version": "0.2.0",
"configurations": [
{
"type": "node",
"request": "launch",
"name": "Launch Program",
"program": "${workspaceFolder}/node_modules/.bin/sls",
"args": [
"invoke",
"local",
"-f",
"hello",
"-d",
"{}"
]
}
]
}
How do you organize your code? Don’t have a mono repo. Use microservices or at least a service oriented architecture. Give every microservice its own repo, along with code only used by this service.
Advice: give the service name the same name as the repo to make it easier to find things.
Organize functions into repositories according to boundaries you identify within your system.
How do you share code between Lambda functions? It depends. Within the same repo
you can use e.g. lib/utils.js
. You can also create an npm package. Or create a
new service to provide the functionality.
When using Lambda, you’ll probably end up using more external services. The functions themselves are simple, but the risk shifts to the integration with the external services.
Another risk is the security. With microservices you have more control, but also more things to keep secure. And thus also more places you can misconfigure.
Combining those two: the risk profile for a serverless application is completely different from “traditional” applications.
]]>These are just notes. They are not proper summaries of the talks.
The dates for the Open Source Summit next year:
Documentation as code means writing, testing, publishing and maintaining
documentation using the same tools developers use for software code.
Advantages of using plain text for documentation, instead of e.g. Word, are that plain text is more accessible and it is easier to do validation. You can also use the same version control tools you use for your code.
You could use issue trackers to automatically generate e.g. release notes.
You can build and test each commit/pull request (continuous integration). You can test all kinds of things: are components indeed available, are links valid, etc.
Generating documentation is one, publishing it is another thing. You could automatically deploy your documentation (continuous deployment).
To make the barrier to entry for contributing lower, you could use a containerized toolchain. This way people can contribute without having to worry about installing all required tools on their own system.
Multiple parties benefit from documentation as code:
There are challenges though. Both writers and developers may resist. Writers have to retrain and it may be a steep learning curve. Developers may feel that doing documentation properly might slow down the development process. Since the documentation is more visible now, developers might become conscious of their language and grammar skills.
There are also technical challenges. Converting to another format might not be trivial and even lead to a disruption in the release cycle. There may also be resource and staffing issues.
(Slides)
A microservice is observable if we can analyze its metrics, logs and tracing information. But you need more than metrics or logs of individual containers. You want the aggregation of all of them to see what is going on and detect patterns.
Distributed tracing tells a story of a request though our services (which services were touched, when, etc). It is especially useful when doing root cause analysis or you want to optimize performance (where am I spending most of my time).
But how does it work? A “span” is a data structure to store units of work in (what, when did it start, etc) plus metadata. Tracing includes references to other spans so you can build graphs of those spans.
Instrumentation can be both implicit and explicit. Explicit instrumentation is done in your code itself. Implicit instrumentation is done via the frameworks you use (e.g. in Sprint Boot in a Java project).
The OpenTracing project wants to document terminology so we are all talking about the same things. After documenting, the next step is offering an API (which the project does for 9 languages at the moment). The result can be used in compatible projects like Zipkin and Jaeger.
Jaeger is a client side tracer that collects data. It sends it to a backend component for storage. A frontend component can then display the results. Jaeger runs on bare metal, OpenShift and Kubernetes.
Data storage for Jaeger is typically on Elasticsearch or Cassandra.
Have a look at OpenTracing API contributions GitHub organization.
(Slides)
Apache CloudStack is a scalable,
multi-tenant, open-source, purpose-built, cloud orchestration platform for
delivering turnkey Infrastructure-as-a-Service clouds.
Virtualization is never going to go away.
Features:
Active community: ~200 project committers, last month 25 merged PRs from 16 authors, plenty of mailing list activity. Once a year there is a CloudStack Collaboration Conference.
You can use CloudStack for private cloud, public cloud or a combination of both in a hybrid cloud.
What can CloudStack give you?
CloudStack can be seen as competition to OpenStack. However, CloudStack is vendor neutral. As a result it is a bit of the invisible man in the cloud infrastructure space.
Why CloudStack?
Terraform is a resource orchestration management tool, not a configuration management tool. So you cannot compare Terraform with e.g. Ansible, Puppet or Chef.
It is not a cloud agnostic tool. It does however provide a single configuration language.
About a year ago, the Terraform Module Registry was introduced.
Chef and SaltStack are supported by Terraform; Ansible not (yet).
Tips:
terraform plan -destroy
” to safely review the destructive actions.TF_LOG
” and “TF_LOG_PATH
” variables for logging.terraform state rm type.resource.name
” to keep resources but remove
them from your state.(Slides)
]]>A curated list of amazingly awesome open source sysadmin resources.
A curated list of things to read to level up your DevOps skills and knowledgeby Chris Short. (Source: DevOps’ish, issue 043)