Job Title: Grafana Architect - Multi-Cloud & On-Prem Observability and Monitoring
Location: Basking Ridge, NJ (Onsite)
Employment Type: C2H
Job Summary
We are seeking a seasoned Grafana Architect with strong expertise in designing and implementing observability and monitoring solutions across multi-cloud (AWS, Azure, GCP) and on-premise environments. The ideal candidate will have deep hands-on experience with Grafana, Prometheus, Loki, Tempo, and integrations with various telemetry sources. You will be responsible for end-to-end observability strategy, architectural governance, implementation, and evangelizing best practices across teams.
Key Responsibilities
Architect and implement scalable observability solutions across hybrid/multi-cloud and on-premise environments using Grafana OSS/Enterprise.
Define monitoring strategies, SLOs/SLIs, dashboards, alerts, and reporting mechanisms for infrastructure, applications, and services.
Integrate Grafana with Prometheus, Loki, Tempo, InfluxDB, Elasticsearch, cloud-native tools (e.g., AWS CloudWatch, Azure Monitor, GCP Operations Suite), and on-prem systems.
Lead design and implementation of custom plugins, data sources, and dashboards for cross-platform observability.
Build and standardize templates, alerting rules, and RBAC models within Grafana Enterprise.
Collaborate with DevOps, SRE, Cloud, and App teams to define observability needs and onboard them into the platform.
Define and implement monitoring as code (MaC) practices using Terraform/Ansible for observability infrastructure.
Govern and optimize telemetry collection (logs, metrics, traces) for performance, cost, and usability.
Lead capacity planning, HA/DR design, performance tuning, and upgrades for Grafana stack.
Provide thought leadership on OpenTelemetry, distributed tracing, log aggregation, and AIOps capabilities.
Conduct training, documentation, and internal community engagement around observability tools.
Required Skills & Experience
5+ years of hands-on experience with Grafana, including dashboard design, plugin development, and user management.
Strong expertise with Prometheus, Loki, Tempo, Alertmanager, and OpenTelemetry.
Proven experience designing multi-cloud (AWS, Azure, GCP) observability frameworks.
Experience integrating with on-premise systems (e.g., vSphere, bare-metal monitoring, SNMP, legacy tools).
Hands-on with Terraform, Helm, Ansible, GitOps practices for monitoring infrastructure.
Strong scripting and automation skills (Python, Bash, etc.).
In-depth knowledge of monitoring standards, telemetry formats (Prometheus metrics, OTLP, JSON logs).
Proficient in SRE principles (SLOs, SLIs, error budgets, alerting strategy).
Experience with RBAC, LDAP/SAML integration, Grafana Enterprise features.
Strong troubleshooting skills in distributed systems and observability pipelines.
Excellent communication, stakeholder management, and leadership skills.
Nice to Have
Experience with AIOps/ML-based anomaly detection in observability.
Knowledge of security and compliance considerations in monitoring (e.g., SOC2, PCI).
Exposure to SIEM tools like Splunk, Chronicle, or Elastic Security.
Experience with Kafka, Fluent Bit, Vector, or similar log forwarding pipelines.
Certifications (Preferred)
Grafana Certified Observability Professional
AWS/GCP/Azure Solution Architect Associate or Professional
Certified Kubernetes Administrator (CKA)
Interested Please share your Resume to