perf(graph): warm schema cache on startup to kill cold-start spikes

Following the schema cache PR, warm pods serve from cache (~24/25 hits on a long-running pod). New pods, however, start cold: the first LatestSchema query per (orgId, ref) still runs the wgc router compose subprocess, which costs 100-300m CPU per call. That cold-start cost is what kept tripping the HPA into TooManyReplicas: HPA scales up → new pod added → new pod runs wgc on first query → metrics spike → HPA scales up further → cycle repeats. Even after the caching PR landed, observed pods cycling 2→4→2→4 in production, with fresh pods showing 2 'Fetching latest schema' (cold) entries and 0 cache hits within their first minute. Add Cache.AllOrgRefs() exposing every tracked (orgId, ref) pair, and Resolver.WarmCache(ctx) which iterates them after the event-sourced caches have been populated. For each ref it fetches the subgraphs, runs sdlmerge, runs CosmoGenerator.Generate, and stores both results in the cache. Errors per ref are logged and skipped so a single bad ref does not block warming the rest. Service startup calls WarmCache right after the Resolver is wired, before the HTTP server starts accepting traffic, so the first LatestSchema query a pod receives is already a cache hit.
2026-05-21 17:10:30 +02:00
parent 9d70c0462a
commit 1549538c70
3 changed files with 85 additions and 0 deletions
@@ -102,6 +102,30 @@ func (c *Cache) Services(orgId, ref, lastUpdate string) ([]string, string) {
 	return services, c.lastUpdate[key]
 }

+// OrgRef identifies a single (organizationId, ref) pair that the cache
+// tracks subgraphs for.
+type OrgRef struct {
+	OrgId string
+	Ref   string
+}
+
+// AllOrgRefs returns every (orgId, ref) pair that currently has at least
+// one subgraph in the cache. Used by startup warmup to pre-compute the
+// merged SDL and SchemaUpdate for every known ref before the pod starts
+// serving traffic.
+func (c *Cache) AllOrgRefs() []OrgRef {
+	c.mu.RLock()
+	defer c.mu.RUnlock()
+
+	var out []OrgRef
+	for orgId, refs := range c.services {
+		for ref := range refs {
+			out = append(out, OrgRef{OrgId: orgId, Ref: ref})
+		}
+	}
+	return out
+}
+
 func (c *Cache) SubGraphId(orgId, ref, service string) string {
 	c.mu.RLock()
 	defer c.mu.RUnlock()