AWS prioritizes the automation of cloud operations for specific reasons. These reasons can be summarized into three key points.
AWS Cloud Operations are structured into three distinct areas:
Setup: The initial aspect of the cloud operations model simplifies the setup of a foundation for operations. This allows for rapid scaling with enterprise governance. When you have security and multi-account built-in easily sets up, deploys, and enables governance.
Build and Migrate: The cloud operations model guides determining which parts of your existing infrastructure and applications to keep, which to migrate to the cloud, and how to develop new ones, etc.
Operate: Focus on the operational phase. This includes monitoring your application's performance and detecting and remediating compliance and operational risks. There are some critical categories within that operating pillar. Observability employs data-driven insights to enhance the visibility into the status of your resources and applications. It involves data collection, performance monitoring, and pattern analysis, among other aspects. Compliance is a prominent concern for organizations, covering a wide range of internal, industry-specific, local, and global regulations, which are complex and dynamic. AWS Cloud Operations helps streamline compliance and security requirements by automating many manual compliance tasks.
Enterprise customers want to swiftly transition to the cloud while aligning with their established ITIL and legacy processes. These processes contain incident management, change management, problem management, and others, reflecting the extensive experience acquired over decades in different domains. In addition, they have developed operational models, solutions, and tools. It's worth noting that many operations management tools often adopt a monolithic approach, operating in isolated silos and lacking cloud-native capabilities, which hinders easy automation. In contrast, startups want to integrate operations into their development processes, emphasizing a high degree of automation. This approach ensures that operational best practices are integrated, allowing for a rapid go-to-market strategy. With these feedback points in mind, this is where the cloud operations model steps in with automation.
AWS allows you to automate operational processes across environments using an agent-based management solution called Systems Manager. This solution works on AWS, on-premises, Edge, IoT, and other cloud providers. This is achieved through the Systems Manager agent, an open-source component that can be installed on your cloud, on-premises, or Edge resources. As long as you can install this agent and establish an internet connection to the respective device, the Systems Manager can manage it. This approach allows you to initiate your automation processes as you migrate. For instance, you can manage on-premises servers with EC2 instances through Systems Manager. This means you can leverage automation from the early stages of your migration without waiting until everything transfers to the cloud. Moreover, in a complex hybrid or multi-cloud environment, you can still rely on a single set of services to automate your operations.
Furthermore, the Systems Manager operates a user operations hub. It integrates data related to resource configuration, observability, compliance, and AI-driven DevOps data. It collects other events from sources such as AWS Config, Amazon CloudWatch, AWS CloudTrail, Amazon DevOps Guru, Amazon EventBridge, and AWS Security Hub.
AWS assists in the automation of repetitive, high-risk, or compliance-related tasks. This automation is key to scalability while maintaining security, enhancing resiliency, mitigating risks, and preserving agility. AWS has over 400 predefined automation documents or runbooks available for your environment. Alternatively, you can build your custom automation using Python or PowerShell. Moreover, aside from the advantages of faster issue resolution and reduced manual workload, this approach offers a security enhancement by reducing individual access to the underlying infrastructure. This is achieved through delegated admin access for pre-approved runbooks. In essence, the runbook gains direct access to the resources, as opposed to individual operators, limiting the number of individuals with direct access to specific servers.
These automations can be initiated by any of your operational processes. AWS Systems Manager Change Manager allows you to define change templates and automatically or manually approve them, depending on your organization's specific processes. Once approved, these changes can be automatically executed at predefined times. For event management, Systems Manager OpsCenter is instrumental in identifying issues from various sources. It offers the capability to launch automation for issue resolution.
Additionally, it can trigger change management operations when necessary or initiate break-glass actions. Incident Manager enables you to define incident response plans for mission-critical applications and alert response teams, provide relevant underlying resource data for troubleshooting, foster collaboration among the team via AWS chatbot, and resolve those incidents using automation or break glass processes. It offers root cause analysis data and recommends next steps, including problem management, to prevent the recurrence of incidents. Node management provides diverse capabilities, such as inventory management, patch updates, remote session management, and more, all of which can interact with these automation processes.
For customers utilizing ITSM solutions such as ServiceNow or Azure Service Management, the AWS Service Management Connector offers bi-directional integration. This integration eliminates the need for manual processes and data entry, and it is particularly beneficial for AWS services, such as Change Manager, OpsCenter, and Incident Manager.
These processes can be executed at two distinct levels: the Resource/Fleet level and the Application/Workload level, as defined by AWS Service Catalog AppRegistry. Audit data, such as the state of resources, alerts, alarms, compliance rules, action statuses, who performed them, and when they were executed, is collected throughout the process. This data can be accessed through the Systems Manager or comprehensively managed via AWS Audit Manager to streamline audit data collection and reporting. It's crucial to understand that the key value proposition lies in the seamless integration and collaboration of all these capabilities. Most of these capabilities can be used independently, allowing your organization to adopt individual service features at a pace that aligns with your specific needs. For example, many customers often begin their journey with Systems Manager using services such as Session Manager or Patch Manager, gradually expanding into more complex IT processes.
The capabilities of AWS Systems Manager can be categorized into three primary areas: Node Management, IT Service Management, and Application Management. At the bottom, you can see "Quick Setup," which simplifies the setup of Systems Manager features for many use cases. These include scanning instances, configuring inventory collection, creating and configuring IAM profile roles, and much more. Automation stands out as a key capability within the Systems Manager. It allows you to eliminate manual errors in your environment by authoring repeatable runbooks triggered by various means, including EventBridge, Maintenance Windows, State Manager, Change Manager, and other automation tools.
Node Management features are designed to provide operators with the essential tools needed for daily activities to maintain the environment. Here's a breakdown of these features:
Fleet Manager: This tool helps you manage a fleet of servers, browse file systems, monitor CPU metrics, and edit Windows registries.
Session Manager: It offers a secure way to access compute nodes, particularly for break glass scenarios, troubleshooting, etc.
Inventory: This feature collects data about your applications, files, network configurations, instance details, Windows registries, and roles.
Run Command: It enables you to execute controlled actions across selected nodes in your fleet, allowing you to manage your resources at scale.
Patch Manager: This tool automates the patching of your cloud and on-premises resources on a large scale.
Parameter Store: This serves as a central repository to externalize configuration values and secrets in an organized manner, categorized by environment or application component hierarchies.
Distributor: It allows you to simultaneously create or deploy software packages, including both AWS-published and third-party packages, to multiple managed instances.
State Manager: This feature offers secure and scalable configuration management, periodically checking and remediating changes to ensure your resources adhere to your pre-configured desired state.
These capabilities collaborate with other AWS services, leveraging automation to enhance visibility and automate issue resolution. They can function independently, or if you use an ITSM solution, you can integrate them through the Service Management Connector.
Explorer: This serves as an operational dashboard that spans across Systems Manager. It provides valuable insights into your accounts and regions by drawing data from AWS Config, Trusted Advisor, Compute Optimizer, patching, and more.
OpsCenter: This serves as the event management engine, aiding in the management and resolution of incidents.
Maintenance Window: This feature allows you to define schedules for potentially disruptive actions on your instances, such as OS patching, driver updates, and software installations.
Change Calendar: It enables you to define specific dates and time ranges for actions that may or may not be performed in your account. This helps in avoiding any adverse impacts on your business operations.
These features help manage and operate your application resources in a context highly relevant to your application team and resource structures.
Application Manager: This tool helps manage applications spanning multiple AWS services, including CloudFormation stacks, resource groups, launch wizard, service catalog, and app registry.
AppConfig: It allows you to create, manage, and deploy changes to application configurations at run time. You can automatically roll back changes in case of errors and smoothly switch on new application features that require timely deployment.
AWS Events