Tag Archives: Operations

What is Observability? And How can it Help Your Operations? | IT Infrastructure Advice, Discussion, Community


Traditional monitoring solutions collect data from endpoints and parse it into a series of metrics. The result is then compared against policies and thresholds to determine the health of applications and systems in the environment.  An unhealthy status is assigned to a monitored system when it’s metrics breach a configured policy, raising an alert or actions are triggered to remediate the condition.

This approach is fine for monitoring and troubleshooting traditional system architectures and monolithic applications. However, conventional monitoring solutions perform poorly with cloud-native applications and distributed infrastructure solutions.

Conventional monitoring solutions have two critical weaknesses with newer applications and infrastructure solutions. Alerts display that something is wrong on an object which is ok when the object still exists but, the alarm is tied linked to an object and therefore if the object no longer exists, neither does the alarm.

The other weakness is the lack of ability to provide end-to-end visibility of the processes which caused the alarm to trigger. This information is still gathered by manually checking logs and attempting to create a timeline of events.

Enter observability platforms

Observability platforms aim to provide operators with an end-to-end view of processes that are related to system degradation, establishing a level of cause and effect instead of just effect. Observability platforms work by determining relationships between objects and actions performed by the objects. Additionally, a deeper set of data is gathered to perform analytics.

Monitoring an application with an observability platform can tell you that a specific service is not running as expected because there has been an increase in load and the autoscaling function has an error. A traditional monitoring solution could alert you that the autoscaling function is generating errors, but not necessarily that the trigger was an increase in traffic to a specific page on the public facing web site. Additionally, the observability platform identifies a particular function on the page to be the cause.

All the information available to link a specific function and a higher demand for a backend service is already available to you. Making those links and getting to that level of detail is time-consuming and if the alert was picked up by an infrastructure team instead of application owners, then it may be missed.

Observability platforms provide the ability to analyse all the available data to provide a clear and concise series of related events, providing both cause and effect resulting in decreased time to resolution.

Similar scenarios can be run against infrastructure and infrastructure services. Reconfiguration of stateful firewall rules can trigger session states to be reevaluated that increases the firewalls resource demand. Depending on the firewall there can be a significant impact different between a single substantial rule reconfiguration and many small rule updates.

A workflow that updates the firewall rule has been updated to take in an array of rules to be modified and applied each change sequentially instead of creating a single large request. The development system and test firewalls only have minimal traffic and experienced no negative impact. The production firewall suffers a significant performance impacted when the updated workflow executes.

In many environments the alert generated by the above scenario would go to the networking team, who should be able to see that the rules were updated but, they may not have visibility into the workflow execution. An observability platform can create a link between the workflow and the resulting impact, establishing a connection between the cause-and-effect and the appropriate response. Both the automation and network teams receive notifications.

Moving forward

By now I have painted a cheerful picture of what observability platforms can offer. However, this is the real world, and nothing is as simple as it seems.

Performing deep analytics has a dependency on the quality of the data received and that all systems and services in scope are configured to send logging data to the observability platform. Inconsistencies in log data between systems can prevent automatic relationship detection, which may require manual relationship configuration.

Some observability platforms provide the capability for application developers to embed integrations directly into code, providing low-level data. Ensuring a consistent log structure throughout the project enhances the platforms ability to determine relationships automatically.

Observability platforms improve troubleshooting by providing a level of end-to-end visibility required for modern distributed systems. Additionally, correlating cause and effect of events enhances a team’s ability to determine why a fault occurred.



Source link

Why Settle for Just “OK” Network Operations? | IT Infrastructure Advice, Discussion, Community


More often than not, just “OK” is not an option. After all, OK expectations can only lead to OK outcomes. This is showcased in a recent popular advertising campaign from AT&T, which depicts scenarios where just OK is not acceptable, portraying an “OK surgeon,” “OK babysitter,” and “OK tattoo artist.” While the commercials are comical, they bring to light some of the very real and not so funny problems that many businesses and specifically IT teams are dealing with.

With the explosion of the internet of things (IoT) and advancements in automation, artificial intelligence (AI), software-defined networking (SDN), and DevOps, many IT professionals are realizing that the processes they once relied on to manage critical areas of the network have become just OK. And when it comes to network operations, just OK is not OK. Networks today are mission critical, often relied upon to keep the entire business up and running. In fact, according to Gartner, the average cost of network downtime is around $5,600 per minute – a massive expense for any organization, especially when you factor in the amount of time it typically takes a network team to troubleshoot an issue using OK, aka manual, methods.

As our IT environments continue to transform, our processes must as well. The role of the network engineer has already evolved to include much more responsibility than ever before, and currently, many are struggling to juggle everything on their plates. As a result, there are several areas where IT teams have accepted an OK standard, but it’s not too late to transform OK to actually effective and efficient.  

An OK approach for complex dynamic networks

SDN is beginning to show some real benefits to organizations that are implementing the technology to create efficient, centralized network management, roll out new applications and services with greater agility, enhance security and reduce operational costs. On the flip side, however, SDN also brings on new operational challenges, creating hybrid network environments where SDN architecture is merged with traditional data center and MPLS networks. These hybrid environments are incredibly complex, consisting of hundreds and thousands of components and undergoing constant change. As the networks continue to become more complex and dynamic, significant visibility issues are created for network teams.  

Ideally, network engineers are able to see both SDN and non-SDN networks side-by-side so they can visualize the physical and logical interconnections and correlate the layers of abstraction at any moment. This visibility becomes critical especially during troubleshooting when speed is of the essence. Remember, downtime can cost an organization $5,600 per minute – with the ability to directly impact the bottom line. Unfortunately, existing troubleshooting and mapping strategies like CLI and network diagramming are less effective in complex hybrid networks, forcing IT teams to race against the clock to identify an issue, increasing MTTR (mean time to repair). End-to-end visibility across hybrid networks is essential for being able to identify and mitigate potential issues quickly. Without it, existing processes are just OK.

Automation takes things up a notch, far beyond just OK, allowing teams to view both traditional and application-centric infrastructure as well as data integration with the SDN console in a single view. This enables enterprises to acclimate to an application-centric infrastructure and understand how application dependencies map to the underlying fabric. In hybrid environments, where abstraction can lead to a cloudy view of the network, automated processes and the right data integration can give engineers the dynamic visibility they need.

OK collaboration between network and application teams

As networks become more software-defined and application-centric, the line between the application and network team starts to get blurry. The two often spend time blaming the other department for an issue and rarely take a collaborative approach to troubleshooting. As long as applications depend on the network to function and companies depend on applications to conduct business, the blame game between the two for slow performance, downtime, or otherwise will continue – that is if just OK network processes are in place.

Not only is there tension between applications and network teams, but there’s also a big knowledge and skills gap between the two, which brings new challenges as network projects start crossing over into application territory and vice versa. This is where automation and visibility come into play. Automation can help network engineers apply existing knowledge to these new environments and allows for IT teams to share their critical knowledge effectively – whether that be design information, troubleshooting steps or network change history. By providing a common visibility framework during troubleshooting and security and enabling teams to codify and share best practices, automation transforms OK IT communications living in silos, to effective collaboration for better results.

As organizations continue to invest in the latest technology and as a result, networks continue to grow in size and complexity, it’s become clear that automation is no longer a luxury, it’s a necessity. Traditional methods of network management simply don’t cut it with the hybrid environments of today. Stop settling for OK outcomes from your IT operations when automation can ensure the network is performing at its best.



Source link

Adapting IT Operations to Emerging Trends: 3 Tips


For infrastructure management professionals, keeping up with new trends is a constant challenge. IT must constantly weigh the potential benefits and risks of adopting new technologies, as well as the pros and cons of continuing to maintain their legacy hardware and applications.

Some experts say that right now is a particularly difficult time for enterprise IT given the massive changes that are occurring. When asked about the trends affecting enterprise IT operations today, Keith Townsend, principal at The CTO Advisor, told me, “Obviously the biggest one is the cloud and the need to integrate cloud.”

In its latest market research, IDC predicts that public cloud services and infrastructure spending will grow 24.4% this year, and Gartner forecasts that the public cloud services market will grow 18%in 2017. By either measure, enterprises are going to be running a lot more of their workloads in the cloud, which means IT operations will need to adapt to deal with this new situation.

Townsend, who also is SAP infrastructure architect at AbbVie, said that the growth in hybrid cloud computing and new advancements like serverless computing and containers pose challenges for IT operations, given “the resulting need for automation and orchestration throughout the enterprise IT infrastructure.” He added, “Ultimately, they need to transform their organizations from a people, process and technology perspective.”

For organizations seeking to accomplish that transformation, Townsend offered three key pieces of advice.

Put the strategy first

Townsend said the biggest mistake he sees enterprises making “is investing in tools before they really understand their strategy.” Organizations know that their approach to IT needs to change, but they don’t always clearly define their goals and objectives.

Instead, Townsend said, they often start by “going out to vendors and asking vendors to solve this problem for them in the form of some tool or dashboard or some framework without understanding what the drivers are internally.”

IT operations groups can save themselves a great deal of time, money and aggravation by focusing on their strategy first before they invest in new tools.

Self-fund your transformation

Attaining the level of agility and flexibility that allows organizations to take advantage of the latest advances in cloud computing isn’t easy or cheap. “That requires some investment, but it’s tough to get that investment,” Townsend acknowledged.

Instead of asking for budget increases, he believes the best way to do that investment is through self-funding.

Most IT teams spend about 80% of their budgets on maintaining existing systems, activities that are colloquially called “keeping the lights on.” That leaves only 20% of the budget for new projects and transformation. “That mix needs to be changed,” said Townsend.

He recommends that organizations look for ways to become more efficient. By carefully deploying automation and adopting new processes, teams can accomplish a “series of mini-transformations” that gradually decreases the amount of money that must be spent on maintenance and frees up more funds and staff resources for new projects.

Focus on agility, not services

In his work, Townsend has seen many IT teams often make a common mistake when it comes to dealing with the business side of the organization: not paying enough attention to what is happening in the business and what the business really wants.

When the business comes to IT with a request, IT typically responds with a list of limited options. Townsend said that these limited options are the equivalent of telling the business no. “What they are asking for is agility,” he said.

He told a story about a recent six-month infrastructure project where the business objectives for the project completely changed between the beginning of the project and the end. An IT organization can only adapt to those sort of constant changes by adopting a DevOps approach, he said. If IT wants to remain relevant and help organizations capitalize on the new opportunities that the cloud offers, it has to become much more agile and flexible.

You can see Keith Townsend live and in person at Interop ITX, where he will offer more insight about how enterprise IT needs to transform itself in his session, “Holistic IT Operations in the Application Age.” Register now for Interop ITX, May 15-19, in Las Vegas.



Source link