agent

Seamless Datadog Agent Upgrades

Ensuring your software remains up-to-date is essential to avoid creating legacy systems and to leverage the most recent features and…

Nicolas Narbais

16 Apr 2024 — 6 min read

Photo by Thorsten Konersmann / Unsplash

Ensuring your software remains up-to-date is essential to avoid creating legacy systems and to leverage the most recent features and security enhancements. When it comes to Datadog, this means updating not only the Agent but also the libraries for APM and RUM utilized within both backend and frontend applications. In this article, we will delve into each component individually and outline the necessary steps to ensure a successful update process.

While our focus will primarily be on Kubernetes environments, the principles discussed here are applicable across various other environments as well.

Why Upgrade [Datadog Agent and libraries]?

Before going any further, let’s have a quick look at why upgrading is important.

Regularly upgrading software is essential for maintaining the security, performance, and compatibility of your systems. Security is a primary concern, as updates often include patches for known vulnerabilities, safeguarding your data and systems from potential breaches. Furthermore, performance improvements and bug fixes included in updates can enhance the overall stability and speed of your software, leading to a smoother user experience.

Moreover, staying current with software updates ensures compatibility with the latest technologies and platforms, preventing issues that may arise from using outdated software. Additionally, updates often introduce new features and functionalities, providing opportunities to increase productivity and streamline workflows. By prioritizing software upgrades, organizations can mitigate risks, optimize performance, and stay competitive in today’s fast-paced digital landscape.

Upgrading the Datadog Agent

The Datadog Agent serves as a foundational component tasked with centralizing, enriching, sanitizing, aggregating, and performing various other operations on cluster data before transmitting it to the Datadog backend. Given its pivotal role in monitoring, the Agent stands as a cornerstone of any upgrade strategy, necessitating careful consideration and attention during the upgrade process.

Fortunately for us, the upgrade of the Datadog Agent has been made easy.

How to

Linux

For most Linux distributions, the main command used to install will also be used for an upgrade.

Helm

The parameters `agents.image.tag` and `clusterAgent.image.tag` are identifying the Agent version and Cluster Agent version. The `values.yaml` simply needs to be updated (all values).

Operator

At this moment, to achieve a proper selection of the image, an override will have to happen with the parameters `spec.override.nodeAgent.image.name` and `spec.override.clusterAgent.image.name`. Once the new configuration file is applied, the Operator will start a rolling update of the new configuration to all Datadog Agents.

Validate a successful deployment

Ensuring the success of a deployment is crucial, but it’s equally important to verify that everything continues to function smoothly post-upgrade. Let’s explore the key checks to perform to validate a successful upgrade and maintain seamless operations.

Running Datadog Agents

First place to check if the Datadog Agents are reporting should be the Fleet Automation space in the platform.It is the place to centrally govern and remotely manage Datadog Agents at scale. In addition, metrics can be leveraged to check the trends and see any drop of agent reporting: `sum:datadog.agent.running{*}`.

If there is an issue with a deployment on Kubernetes, the orchestrator view provides additional information on the state of the pods this includes the Kubernetes events coming from the agent. Now, this data depends on the information collected by the Cluster Agent so the view may be limited if the Cluster Agent is not responding.

Running Cluster Agent

The Cluster Agent is also an important part of monitoring in a Kubernetes cluster environment. The first step is then to check the running status of this Cluster Agent `sum:datadog.cluster_agent.running{*}`.

To make sure things are running as expected, the orchestrator view is also a great place to check.

Running Trace, Process, System Probe

The Datadog Agent is composed of multiple subcomponents depending on the integration turned on. It is also important that those are working as expected. To simplify

The trace agent will be responsible for a big part of the APM work at agent level. Its health can be validated with `sum:datadog.trace_agent.heartbeat{*}`
The process agent is responsible for the process collection including container level data. Its health can be validated with `sum:datadog.process.agent{*}`
The system probe is a piece of the agent responsible for network level data collection. Its health can be validated with `sum:datadog.npm.host_instance{*}`

Upgrading the tracing and RUM libraries

At this stage, barring major upgrades, the internal APIs of the SDK should remain consistent. The procedure becomes more straightforward, with the primary requirement being to verify the continued arrival of RUM and trace data. The package manager will incorporate the library version, and resolving any issues would typically involve a manual upgrade through it.

On APM, for deployments leveraging the single step instrumentation, the library version can be controlled centrally.

Tracking the various versions

Ensuring that the agents are actually running is critical for a healthy cluster but this article focuses on keeping the version of the various components up to date. It is then important to visualize the version of the various components in Datadog.

Agent and Cluster Agent

First and foremost, the Fleet Automation solution is ideal for such a case. More details on Fleet Automation can be found in our doc.

sum:datadog.agent.running{*} by {version} 
sum:datadog.cluster_agent.running{*} by {version}

Tracing libraries

sum:datadog.trace_agent.receiver.trace{*} by {tracer_version,lang}.as_count()

RUM

RUM query to list the SDK versions

Staying up to date

Of course, you may be interested in staying up to date and receiving notifications when a new agent or APM or RUM version is released. For that, you can subscribe to RSS feeds from github and maybe use the platform itself to receive an alert from the Datadog platform. Check our article.

Tracking upgrades and troubleshooting

While the preceding content offers an array of resources for initiating upgrades and fostering confidence in the process, we’ve taken a step further by consolidating this information into a user-friendly dashboard. This dashboard, available for import into your organization, can be accessed at this repo.

Best practices at scale

Every organization operates uniquely, from their processes and policies to their implementation strategies. However, through my experience, I have identified a highly effective approach.

On a quarterly basis, typically during the final month, the central SRE best practices team within the organization initiates the process of upgrading the agent and its associated libraries. This phase focuses on ensuring smooth operations within smaller environments such as development or staging, addressing any upgrade-related issues promptly. These issues might originate from Datadog itself, prompting the team to raise tickets on help.datadoghq.com, or may stem from adjustments required within the pipeline and applications.

Insights gained from these initial upgrades are meticulously documented in a shared repository for future reference.

Following the initial testing phase and usually in the last two weeks of the quarter, the central team oversees the upgrade process for all clusters and applications. Implementing such a structured timeline serves multiple purposes: it ensures that each team allocates dedicated time for the upgrade, thereby providing the central team with the necessary resources to support and manage the process effectively.

Furthermore, as teams anticipate communications regarding the upgrade, it presents an opportune moment to share project updates, such as expansions, proofs of value, best practices, and forthcoming enhancements.

In conclusion, maintaining an up-to-date Datadog environment is essential for maximizing its value and ensuring the security and performance of your systems. Throughout this guide, we’ve explored the importance of regular software upgrades, particularly focusing on the Datadog Agent and associated libraries. We’ve outlined practical steps for upgrading these components in various environments, including Linux, Helm, and Operator setups.

Additionally, we’ve discussed the criticality of validating a successful deployment post-upgrade, emphasizing the importance of monitoring Datadog Agents, Cluster Agents, and other essential components to ensure smooth operations.

By following best practices and implementing a structured upgrade strategy, organizations can minimize risks, optimize performance, and stay ahead in today’s rapidly evolving technological landscape. Remember, staying proactive with upgrades not only enhances security and performance but also unlocks new features and functionalities that drive business success. With the insights shared in this guide, you’re well-equipped to navigate Datadog upgrades confidently and effectively.