In the current digital age, where nearly all aspects of our lives are deeply connected to online activities, the backbone of these operations lies in sophisticated software applications running on highly scalable and optimized production systems. Gone are the days of monolithic applications; today, most applications are deployed on the cloud and involve multiple distributed systems and microservices. While this architectural shift has brought numerous benefits in terms of scalability and agility, it has also introduced new challenges in monitoring and debugging software.
Issues with traditional monitoring systems
In traditional software development, monitoring systems have been crucial in ensuring the health and performance of applications. These systems typically involve logging and metrics, providing developers with insights into the application’s behaviour. Dashboards and alerts are set up to notify teams of any metric failures or issues. While this approach works well for known issues and anticipated problems, it falls short when dealing with unforeseen complexities and edge cases.
Software systems have become increasingly complex, making it difficult for developers to pinpoint the root cause of unexpected problems. The normal workflow of debugging involves searching through logs and metrics to identify the issue. However, what if the problem lies with a metric that has not been tracked before? In such cases, an investigative approach is often taken, which involves adding logs to potential error points and waiting for the issue to reoccur.
Moreover, experienced engineers tend to have a higher probability of resolving issues due to their familiarity with the system. However, when these key individuals move out of the team, the entire group may face difficulties in resolving similar problems in the future. Also to resolve issues, engineers must leap from one tool to another, trying to correlate observations between them. This process can be cumbersome and time-consuming, especially when trying to understand complex interactions between various components. Furthermore, the data in monitoring-based systems is pre-aggregated and lacks the flexibility for in-depth exploration. If engineers want to investigate further or ask new questions, they need to mentally carry the context between different tools, which can hinder efficient debugging.
Enter observability
This is a situation where an observable software system changes the game. Observability is a sociotechnical approach that emphasizes understanding and explaining the state of a system, no matter how novel or unusual it may be. It goes beyond just monitoring and aims to enable engineers to have a complete picture of what is happening within their applications.
At its core, observability is about how well one can understand or explain any bizarre state the system can get into without the need for shipping custom code to figure it out. Observability starts during the design phase and extends throughout development and deployment, making it an integral part of the software development lifecycle.
The main benefit of observability is that it allows software engineers to debug and analyze the inner workings of an application with great precision. With observability, the focus is to preserve as much context around any given request as possible. This contextual information is crucial as it helps in reconstructing the environment and circumstances that triggered a specific bug or failure mode. Unlike monitoring-based systems that focus on known unknowns, observability tackles unknown unknowns, which are often the most challenging to handle.
structured events and open telemetry
Enough said, but how can observability be achieved? The fundamental concept involves gathering telemetry data through structured events. These events track the complete journey of each request through various pathways and services in the application. Distributed tracing is pivotal in this process, allowing engineers to follow a single request’s progression across different services. The structured event is mainly a map which has unique identifiers for each request as it passes through each service to ensure maximum information is preserved.
While writing boilerplate code to add traces in each request and service is tedious, projects like OpenTelemetry have emerged to enable tracking based on tracing. OpenTelemetry’s automatic instrumentation and custom instrumentation are powerful tools that enable capturing essential events. The open-source library automatically instruments supported frameworks and libraries, ensuring that vital context is preserved for each request. On the other hand, custom instrumentation allows developers to tailor telemetry data collection to specific parts of the codebase, capturing domain-specific events and context. By combining both automatic and custom instrumentation, teams can comprehensively capture events, follow request journeys, and gain a deep understanding of application behaviour.
Other vendor-based solutions are also available in the market, providing powerful tools for software engineers to slice and dice data efficiently. The focus is to record information with high cardinality and across the maximum number of dimensions so that when an issue rises anyone on the team can explore the system in an open-ended manner. This flexibility is invaluable when dealing with unexpected situations, as it empowers engineers to adapt and respond quickly to challenges.
challenges for building observable systems
While observability offers significant benefits, implementing it comes with its own set of challenges. Gathering telemetry data at full resolution can be resource-intensive, especially in large-scale applications. Engineers must balance capturing enough data to gain valuable insights and avoiding overwhelming data volumes. Additionally, ensuring data privacy and security is crucial when dealing with telemetry data. Implementing proper access controls and encryption mechanisms is essential to protect sensitive information.
Observability represents a paradigm shift in monitoring and debugging software systems. By adopting an observable approach, teams can proactively handle unknown unknowns and effectively resolve complex issues.