DevOps | Platform Reliability Engineer | Careers

Mission

Guarantee 99.5%+ uptime for a platform that spans edge devices on factory floors, a real-time MR sync layer, and a cloud SaaS backend. You own the infrastructure that every other engineer deploys on – if you fail, everything fails simultaneously.

System Ownership

Primary: CI/CD pipelines (build, test, deploy automation for all services and edge firmware)
Primary: Infrastructure provisioning and management (Terraform/Pulumi IaC, Kubernetes clusters, cloud resources)
Primary: Monitoring, alerting, and observability stack (metrics, logs, traces, dashboards, PagerDuty)
Primary: OTA (Over-the-Air) update pipeline for edge devices
Primary: Security hardening and compliance (network policies, secrets management, vulnerability scanning)
Secondary interface: Backend team (you deploy and scale their services)
Secondary interface: Edge AI team (you manage OTA firmware delivery)
Does NOT own: Application code (individual teams), ML model training (AI team), product features (Full Stack team)

What You Will Build

Kubernetes infrastructure – Multi-node EKS/GKE clusters. Namespaced environments (dev, staging, production). Auto-scaling policies for API servers, analytics workers, and model serving endpoints. Resource limits and quotas per service.
CI/CD pipelines – GitHub Actions (or equivalent) for every service: lint → unit test → integration test → build container → push registry → deploy staging → smoke test → promote production. < 15 min from push to production for non-breaking changes.
Monitoring & observability stack – Prometheus + Grafana for metrics. Loki or ELK for logs. Jaeger or Tempo for distributed traces. Pre-configured dashboards for: API latency, error rates, edge device health, inference latency, storage growth, cost tracking.
OTA update pipeline – Push firmware and model updates to 100+ edge devices (Jetson Orin / Qualcomm XR2) deployed in factories and construction sites. Staged rollout (canary → 10% → 50% → 100%). Automatic rollback on health check failure. Devices may be on unreliable networks.
Infrastructure as Code – Every cloud resource defined in Terraform. State stored remotely with locking. Module library for common patterns (VPC, EKS cluster, RDS instance, S3 bucket, IAM roles). New environment spinup from IaC in < 30 minutes.
Security – Network policies (zero-trust between services), secrets management (HashiCorp Vault or AWS Secrets Manager), container image scanning (Trivy), dependency vulnerability scanning, SOC 2 evidence collection automation.
Disaster recovery – Automated database backups (point-in-time recovery), cross-region replication strategy, documented and tested recovery runbooks, RTO < 4 hours, RPO < 1 hour.

Core Technical Responsibilities

Design and manage Kubernetes clusters: deployment strategies (rolling, blue-green, canary), HPA (Horizontal Pod Autoscaler), cluster autoscaler, node pools, spot/preemptible instances for non-critical workloads
Build and maintain CI/CD pipelines for 10+ microservices + edge firmware + ML models. Ensure pipeline reliability (no flaky tests), speed (< 15 min), and security (SAST/DAST scanning integrated)
Implement the OTA update system: delta updates (only changed binaries), cryptographic signing, staged rollout with health check gates, rollback automation, support for intermittent connectivity
Build the observability stack: define SLOs (Service Level Objectives), create SLI dashboards, configure alerts with proper severity levels (P1: page immediately, P2: page during business hours, P3: ticket)
Manage cloud costs: reserved instances vs. spot instances vs. on-demand, storage lifecycle policies, right-sizing recommendations, monthly cost reviews with engineering leads
Implement infrastructure security: VPC design with private subnets for databases, WAF for public APIs, TLS certificates management (cert-manager + Let's Encrypt), pod security policies
Build runbooks for common incidents: database failover, Kubernetes node failure, edge device offline, API latency spike, storage nearing capacity

Required Technical Mastery

Containers & orchestration: Docker (multi-stage builds, buildx, security scanning), Kubernetes (deep knowledge – not just kubectl apply). StatefulSets, DaemonSets, Jobs, CronJobs, ConfigMaps, Secrets, NetworkPolicies, RBAC, Custom Resource Definitions
Infrastructure as Code: Terraform (modules, state management, workspaces, import, drift detection) or Pulumi. GitOps workflow (ArgoCD or Flux)
CI/CD: GitHub Actions, GitLab CI, or Jenkins – ability to design complex pipelines with matrix builds, caching, parallelism, approval gates, and environment promotions
Cloud platforms: AWS (primary) or GCP – production experience with: VPC, EKS/GKE, EC2, ECS, S3, RDS, ElastiCache, SQS/SNS, CloudWatch, IAM, KMS, Route53, ALB/NLB
Monitoring: Prometheus (PromQL, recording rules, alerting rules), Grafana (dashboard design, alerting), Loki or ELK (log aggregation), Jaeger or Tempo (distributed tracing)
Networking: Load balancers (L4/L7), DNS, TLS/mTLS, VPN, service mesh (Istio or Linkerd – awareness level), CDN
Security: OWASP Top 10 awareness, container security (Trivy, Falco), secrets management, network segmentation, compliance frameworks (SOC 2, ISO 27001 – awareness level)
Scripting: Bash, Python for automation. Golang desirable for custom operators or tooling
Edge/IoT: OTA update protocols, device fleet management, intermittent connectivity handling (desirable – can learn)

Production Challenges You'll Solve

Edge OTA on unreliable networks – A factory in rural Maharashtra has 15 Mbps bandwidth shared across the facility. You need to push a 200MB firmware update to 12 Jetson Orin devices deployed on the shop floor. Delta updates, resumable downloads, and integrity verification are mandatory. One bricked device means a service call.
Multi-tenant resource isolation – Tenant A's analytics batch job is consuming all Kubernetes cluster CPU, starving Tenant B's real-time API. Implement resource quotas, priority classes, and pod disruption budgets so batch workloads never impact real-time services.
Alert fatigue – Engineers are ignoring alerts because 80% are noise. Redesign the alerting strategy: derive SLOs from business requirements, create error budget dashboards, alert only on SLO burn rate, not on every metric spike.
Cost control – Cloud bill jumped 40% last month. Diagnose: was it data transfer, compute scaling, storage growth, or a misconfigured resource? Build automated cost anomaly detection and weekly cost reports broken down by service/team.
Zero-downtime database migration – The Backend team needs to change the primary key structure of a 500M-row table. You need to design the migration strategy: online schema change, dual writes, shadow reads, cutover validation – all while maintaining API availability.
Incident response at 2 AM – The API is returning 503s. Is it a Kubernetes node failure, a database connection pool exhaustion, or an upstream dependency outage? Your runbooks and dashboards must let the on-call engineer diagnose and remediate within 15 minutes.

Success KPIs

KPI	Target	Measurement
Platform uptime	≥ 99.5%	Monthly, measured by external uptime monitor
Deployment frequency	≥ 2 production deploys/week per service	CI/CD pipeline metrics
CI/CD pipeline time	< 15 minutes push-to-production	Pipeline duration metrics
Mean Time to Recovery (MTTR)	< 30 minutes for P1 incidents	Incident management system
OTA success rate	≥ 99% of devices updated per rollout	Device fleet management dashboard
Alert signal-to-noise ratio	≥ 80% actionable alerts	Alert audit (reviewed monthly)
Infrastructure cost efficiency	≤ 15% month-over-month unplanned increase	Cloud billing dashboard
Disaster recovery tested	≥ 1 DR drill per quarter	Runbook execution records

Failure If Underperforming

Platform goes down → every customer is blind. No edge devices syncing, no dashboards loading, no MR sessions working. At seed stage, one extended outage can lose a key pilot customer.
Slow CI/CD → engineers ship less frequently → bugs accumulate → quality degrades → customers notice. Deployment friction is a tax on the entire engineering team.
OTA failure → bricked edge device in a customer's factory → requires physical service visit → ₹50K+ cost per incident, plus erosion of customer confidence in remote manageability.
Alert fatigue → real incident gets buried in noise → P1 issue goes unnoticed for hours → SLA violation → penalty clauses in enterprise contracts.
Security breach → customer data leak → existential risk for a seed-stage company. One headline about a data breach kills enterprise sales pipeline permanently.

Collaboration Interfaces

With	Interface
Backend Engineer	You deploy and scale their services. Joint ownership of deployment configs, scaling policies, and database operations.
Edge AI Engineer	You build and manage the OTA pipeline they use to push firmware and models. Device health telemetry feeds your monitoring stack.
Applied AI Engineer	Their model training jobs need GPU compute. You provision and manage GPU nodes (spot instances, scheduling).
Full Stack Engineer	You manage the CDN, SSL, and hosting for the web application. CI/CD for frontend builds.
All Engineers	You define and enforce the CI/CD standards, Dockerfile conventions, and deployment practices everyone follows.

Why This Role Is Mission-Critical

The D2R platform spans three distinct deployment environments – edge devices in harsh industrial settings, real-time MR sync layers, and cloud SaaS infrastructure. Each has different reliability requirements, different failure modes, and different security constraints. One engineer must hold the entire operational picture. Without this role filled by someone who can think across all three planes, the platform ships features but doesn't survive contact with production reality.

About Us

Building the D2R (Design-to-Reality) platform – sub-millimetre CAD alignment + edge AI + mixed-reality overlay for industrial field workers. Venture-backed, seed-stage, < 20 engineers.

Location: Bangalore / Hyderabad
Stage: Seed / Pre-Series A (venture-backed)
Industries: Construction, Manufacturing, Infrastructure, Energy