IT application performance monitoring tools can be used to visualize computer metrics and produce alerts for user defined criteria. Knowing the health of applications, however, is another matter. Even with aggressive regression testing, pushing a software release leaves the IT department breathless waiting for evidence — good or bad — of application health as the release is pushed from server to cluster to data center. How can IT monitor in real-time how an application is performing “in the wild” instead of, well, just waiting for bad news?
Customers’ IT Systems data are shipped using stats and collected in a Hadoop data lake. From here, customers use off-the-shelf integrated visualization tools and streaming analytics to fire off alerts.
The currently monitored metrics can be piped directly to Falkonry through a DataDog Agent where it simultaneously examines patterns in multiple metrics. Falkonry learns the patterns of behavior for a given application deployed in a cluster at different points in time and in different geographies. When any piece of the software stack used in that application changes, Falkonry recognizes any new pattern of performance in one or multiple metrics. Falkonry can observe across all machines to notice any pattern of new behavior across a subset of those machines to report a shift in the behavior of the cluster. Where this behavior is known to coincide with a new piece of software in the stack, an operator can intervene to prevent the propagation of that change across the rest of the cluster.
Falkonry Service points out a deterioration of application cluster behavior within minutes, thus reducing the amount of customer satisfaction and revenue issues. Moreover, early recognition leads to reduced productivity loss in performing upgrades. Because Falkonry results are conveyed to the same data lake, existing tools can be used to create dashboards and notifications for site reliability engineers.