Opsgenie Resources
No description available.
Opsgenie Resources
The following Opsgenie resources are configured (see resources):
- API Integrations
- Teams
- Users
- Notification Policies
- Alert Policies
- Services
- Service Incident Rules
- Escalations
api_integrations.yaml
NOTE: We provision a Datadog integration without specifying the owning team.
Because of that, all alerts coming to Opsgenie from Datadog do not get assigned to a team automatically (if we specified the owning team,
then all alerts would go to the members of the team).
We assign alerts to the teams in the Alert Policies - when the filter conditions are true
, the incoming alert gets assigned to a team.
This way, we can filter out and assigns to the teams only the actionable alerts (you can still view all alerts in the Opsgenie UI).
api_integrations:
- name: datadog
type: Datadog
# Use an empty value for `owner_team_name` to make it a global integration
owner_team_name:
See Opsgenie API Integration for more details.
teams.yaml
Users are assigned to teams in teams.yaml
.
We can assign the existing users (those that already present in Opsgenie, e.g. from Jira), or we can create new users and assign them to teams.
Describe the existing users in existing_users.yaml
(see below). These users will be looked up using the data source data "opsgenie_user"
.
Describe new users in users.yaml
(see below). These users will be created in Opsgenie.
NOTE: The user’s username
is email and must be unique.
NOTE: Once a user is created by the module, it’s not possible to destroy it using Terraform (not supported by the Opsgenie Terraform provider).
teams:
- name: devops
description: "Infrastructure Team"
members:
- username: user1@example.com
role: admin
- username: user2@example.com
role: user
- name: example-team
description: "Example Team"
members:
- username: user3@example.com
role: admin
- username: user4@example.com
role: user
See Opsgenie Teams for more details.
existing_users.yaml
The existing users (those that are already in Opsgenie) are described here.
These users will be looked up using the data source data "opsgenie_user"
.
existing_users:
- username: user1@example.com
- username: user2@example.com
- username: user3@example.com
- username: user4@example.com
See Opsgenie Users for more details.
users.yaml
New users (to be created by the module) are described here.
NOTE: Once a user is created by the module, it’s not possible to destroy it using Terraform (not supported by the Opsgenie Terraform provider).
users:
- username: user5@example.com
full_name: New User
role: User
locale: "en_US"
timezone: "America/New_York"
See Opsgenie Users for more details.
notification_policies.yaml
Notification Policies are used to apply different operations (e.g. delay/suppress
, auto restart
, and auto close
) to all team alert notifications.
notification_policies:
- name: auto-close-based-on-priority
team_name: test
auto_close_action:
time_unit: minutes
time_amount: 120
filter:
type: match-all-conditions
conditions:
- field: priority
operation: less-than
expected_value: P3
See Opsgenie Notification Policy for more details.
escalations.yaml
Escalations are used to escalate the alerts and incidents to a top-level Team if they do not get acknowledged during the specified amount of time.
Escalations are also used to notify responders according to a given order.
escalations:
- name: example-team-escalation-to-devops
description: "Escalate to 'devops' team if 'example-team' does not acknowledge in 10 minutes"
owner_team_name: example-team
rule:
condition: if-not-acked
notify_type: all
delay: 10
recipients:
- type: team
team_name: devops
repeat:
wait_interval: 10
count: 2
reset_recipient_states: false
close_alert_after_all: false
See Opsgenie Escalations for more details.
Flow
The following flow of events is supported:
-
Datadog sends alerts to Opsgenie. All incoming alerts are shown in the Opsgenie UI, but the alerts don’t get assigned to teams automatically.
-
The Alert Policies get evaluated by looking for a specific text in the alert’s message or description. If the filter conditions in any Alert Policy are evaluated to
true
, the policy gets executed and the alert gets assigned to the team specified in the Alert Policy. Also, a tag with the name of the service gets added to the alert. -
The Service Incident Rules get evaluated. If the filter conditions in any Service Incident Rules are evaluated to
true
, the rule gets executed, and an incident is created for the service and assigned to the team the service belongs to. The users of the team get notifications about the incident (via the configured channels, e.g. email, SMS, Opsgenie app, etc.). On the other hand, if the filter conditions in any Service Incident Rules are evaluated tofalse
, Opsgenie does not create an incident, but instead notifies the users of the team about the alert via the configured channels. -
If the alert or incident is not acknowledged by any of the team members during the specified amount of time, the Team’s Escalations get evaluated. If Opsgenie finds an Escalation for the team, it sends notifications to the recipients of the Escalation (e.g. to the users of a top-level Team).
New Service Setup
The Opsgenie resources for a new service are provided in a separate YAML config file (for readability and easy of management).
To add a new service configuration, create a new YAML file with the name of the service.
See resources/services for details on each service.
Each service’s config file contains the three sections:
service
- provides the name of the service and the name of the team the service belongs toalert_policies
- a list of Opsgenie Alert Policies for the serviceservice_incident_rules
- a list of Opsgenie Service Incident Rules for the service
Below are the steps to create Datadog monitors and Opsgenie alert policies and incident rules for a new service.
NOTE: We will be using example-service
as an example.
-
In the datadog-monitor project, add a new YAML file with Datadog monitor configurations for the new service. For the
example-service
, the file name is example-service.yaml. -
Configure Datadog monitors for the service. For example, to monitor the error rate on
prod
, add the following configuration:
example-service-prod-high-error-rate:
name: "(example-service) Service example-service has a high error rate on env:prod"
type: query alert
query: |
sum(last_10m):( sum:trace.flask.request.errors{service:example-service,env:prod}.as_count() / sum:trace.flask.request.hits{service:example-service,env:prod}.as_count() ) > 0.05
message: |
example-service error rate is too high on env:prod
escalation_message: ""
tags:
- "ManagedBy:Terraform"
- "service:example-service"
- "env:prod"
- "alert:high-error-rate"
notify_no_data: false
notify_audit: true
require_full_window: false
enable_logs_sample: false
force_delete: true
include_tags: true
locked: false
renotify_interval: 0
timeout_h: 0
evaluation_delay: 60
new_host_delay: 300
no_data_timeframe: 10
threshold_windows: { }
thresholds:
critical: 0.05
warning: 0.01
Note that the tags
added to the monitor can be used in Opsgenie alert policies and incident rules to match specific alerts from Datadog.
- Add the users responsible for the service to Opsgenie Users
(or to
users.yaml
if the users don’t yet exist in Opsgenie, and you want to create them with Terraform).
existing_users:
- username: user1@example.com
- Assign the users to the Opsgenie Team
- name: example-team
description: "Example Team"
members:
- username: user1@example.com
role: admin
-
Add The service and Opsgenie Alert Policies and Service Incident Rules
NOTE: The alert policy will assign the Team specified in the
responders
section to the alerts. Theresponders
section is a list, so you can assign many teams as responders to the alerts.
service:
- name: example-service
team_name: example-team
alert_policies:
- name: example-service-alert-policy
owner_team_name:
tags:
- "ManagedBy:terraform"
- "service:example-service"
filter:
type: match-any-condition
conditions:
- field: description
operation: contains
expected_value: "example-service"
- field: message
operation: contains
expected_value: "example-service"
continue_policy: true
ignore_original_responders: true
responders:
- type: team
team_name: example-team
service_incident_rules:
- name: example-service-incident-rule
service_name: example-service
incident_rule:
condition_match_type: match-any-condition
conditions:
- field: tags
operation: contains
expected_value: "service:example-service"
incident_properties:
message: example-service is having issues
priority: P2
stakeholder_properties:
message: example-service is having issues
enable: true
NOTE: In the Alert Policy, condition_match_type: match-any-condition
is a logical OR
, which means if any condition is true
, the alert will be
assigned to the service’s team. In the example above, alerts will be assigned to the example-team
team if the alert’s message or description contains example-service
.
If the condition matches, we also add the tag service:example-service
to the alert, which we use in the conditions of the Service Incident Rule.
NOTE: In the Service Incident Rule, we check if the alert’s tags contain the service name tag (service:example-service
in this case).
If the condition matches, we create an incident and assign it to the team, the members of which get notifications about the incident.
- Provision the
datadog-monitor
andopsgenie
projects with Terraform. Datadog will monitor theexample-servise
with the provisioned monitors and send alerts to Opsgenie.