ServicesPhoenix Search

Architecture

High-level and low-level architecture for Phoenix Search

👤 Sai Tharun

Phoenix Search Architecture

Phoenix Search has two tightly coupled planes:

PlanePrimary JobMain Runtime
Read planeServe authenticated user search and user-detail lookupFastAPI API service
Sync planeKeep user_details in Elasticsearch fresh from MySQL changesDebezium, Redpanda, CDC consumer, user_meta, ES ingest pipeline

The API reads from Elasticsearch for search, but MySQL remains the source of truth for patient details and search scope resolution. CDC is the write-side projection system that keeps Elasticsearch usable for low-latency search.


High-Level Architecture

System Boundaries

BoundaryOwned By Phoenix SearchExternal Dependency
HTTP APIFastAPI app, auth middleware integration, user search routes, health and metricsALB / caller applications
Search indexuser_details index shape, query builder, routing strategyElasticsearch cluster
Source dataRead access to userDetails for details and resolver lookupsLiveHealth MySQL schema and application writes
CDC materializationConsumer handlers, user_meta projection contract, ingest pipelineDebezium Connect and Redpanda infrastructure
SessionsSession model validation and scope extractionDjango session data in Redis
Operationscdc-ctl, backfill, dashboards, runbooksECS, EC2, MySQL, Redpanda, ES, HyperDX

Runtime Topology

RuntimeProcessKey EntrypointMain Dependencies
APIFastAPI / Uvicorn or Gunicorn workersearch.web.application:get_appRedis, MySQL, Elasticsearch
CDC consumerPython async Kafka consumercdc/consumers/consumer.pyRedpanda, MySQL, Elasticsearch
Debezium ConnectKafka Connect workerConnector JSON in tools/cdc-ctl/connectors/MySQL binlog, Redpanda
BackfillGo binarybackfill/main.goMySQL, Elasticsearch
cdc-ctlGo operator CLItools/cdc-ctl/main.goKafka Connect REST, Redpanda, MySQL, ES

The API and CDC consumer are separate runtimes. They share dependencies and the same target index, but they do not call each other directly. The API only observes CDC through Elasticsearch freshness metrics and health checks.


High-Level Data Flow

Read Path

Write / Sync Path


Low-Level API Architecture

Application Assembly

The FastAPI app is assembled in search/web/application.py.

StepCodeResponsibility
1configure_logging()Configure process logging
2sentry_sdk.init(...)Enable Sentry when SEARCH_SENTRY_DSN exists
3FastAPI(..., lifespan=lifespan_setup)Create the application with startup/shutdown lifecycle
4app.include_router(monitoring_router)Register /, /health, /health/live, /health/ready, /metrics
5app.include_router(api_router, prefix="/api")Register /api/v1/users/search routes
6register_exception_handlers(app)Normalize application errors
7SessionAuthMiddlewareAuthenticate non-guest routes
8CORSMiddlewareApply CORS policy
9StripPrefixMiddlewareRemove /phoenix-search before route matching
10trace_id_header middlewareRefresh rate-limit cache, stamp load-test attributes, return X-Trace-ID

Starlette middleware runs in reverse add order, so StripPrefixMiddleware is the first custom middleware to see production requests with the ALB prefix.

Startup and Shutdown

search/web/lifespan.py owns process lifecycle.

Route Registration

FileRoute LayerRegistered Paths
search/web/api/router.pyVersioned API router/api/v1/*
search/domains/user_search/router.pyUser search domain router/api/v1/users/search
search/web/api/monitoring/views.pyMonitoring router/, /health, /health/live, /health/ready, /metrics

Development and test environments also register debug routes under /api/debug.


Search Request Internals

Search Context

SearchContext is the immutable bundle that moves through the search pipeline:

FieldSourceWhy It Matters
querySanitized request bodyDrives query shape classification
sizeSEARCH_DEFAULT_SIZECaps hit count
search_keyRequest bodySelects a SEARCH_MAPPING field set
filtersSession scopeEnforces lab/org/branch/referral access
routingResolved lab IDsTargets Elasticsearch shards
search_fieldsOptional request bodyNarrows allowed searchable fields
date_format_localeSessionParses DOB queries in lab-specific format

Scope Resolution

Login / Search ShapeScope BehaviorCode
Normal lab userterm lab_id = resolved labbuild_session_filters
Doctor loginreferral_ids when referral session is present_extra_filter
Branch loginbranch_ids or org_ids depending on search_type_branch_login_filter
Collection center org loginorg_ids using org plus sub-org expansion_resolve_org_ids, _extra_filter
Multi-center searchterms lab_id across related labs, still with login-specific scope_resolve_lab_ids, build_session_filters

The caller cannot choose lab_id directly. Lab and org scope always comes from the authenticated session and MySQL lookup helpers.

Query Builder

search/domains/user_search/query.py is intentionally shape-routed:

ShapeMain Clause FamiliesAvoids
phoneContact exact/prefix, identity, buckets, patient ID, numeric IDsBroad name matching
numericPatient ID, identity, bucket IDs, numeric IDs, DOBBroad name matching
alphaPatient ID, identity, buckets, full nameNumeric-only clauses
mixedStructured IDs, buckets, full name, DOB if parseableWildcards

Important query rules:

  • SEARCH_MAPPING selects the allowed logical fields for each search_key.
  • search_fields intersects with the selected mapping; no overlap returns an empty result.
  • There are no wildcard queries and no fallback catch-all query.
  • Search results sort by _score and then last_updated_time desc.
  • matched_field comes from ES highlights or named queries.
  • Buckets come from named query tags, then the service splits multi-center hits into other_labs.

Elasticsearch Repository

search/domains/user_search/repository.py owns the ES call:

ConcernBehavior
Circuit breaker_es_search is wrapped by es_breaker
Trace propagationActive trace ID is passed as opaque_id
RoutingLab routing is passed to ES when available
TimeoutUses SEARCH_ES_SEARCH_TIMEOUT
MetricsRecords query duration, query count, hit count, and zero-result count
Circuit openReturns empty hits rather than failing the API response
Other ES failuresPropagate after recording error metrics

User Detail Lookup Internals

User detail lookup is intentionally MySQL-backed, not ES-backed.

This path uses MySQL so detail views are source-of-truth even if Elasticsearch is temporarily stale.


Low-Level CDC Architecture

Connector Layer

ConnectorCapturesTopic PrefixPartition RoutingPurpose
phoenix-source-existinguserDetails, billing, labReportRelationphoenixid or userDetailsId_idSource table changes into Redpanda
phoenix-source-projectionuser_metaphoenixuser_details_idProjection changes into Redpanda for ES sync

Production connector properties that define the architecture:

PropertyWhy It Matters
database.include.list = livehealthappConnector reads the source database binlog
table.include.listRestricts emitted records to search-relevant tables
snapshot.mode = when_neededRecovers schema/offset gaps without normal full table snapshots
snapshot.select.statement.overrides ... WHERE 1=0Captures schema without letting Debezium own historical data load
PartitionRouting SMTKeeps all events for the same user on the same partition number
signal.enabled.channels = sourceEnables controlled Debezium signal-table actions
heartbeat.action.queryUpdates debezium_heartbeat so operators can detect stalled binlog reads
errors.deadletterqueue.topic.nameSends connector/SMT failures to the Kafka Connect DLQ

Consumer Layer

The consumer is at-least-once. Offsets are stored only after successful handling or after DLQ publication. Replays are expected, so handlers use idempotent writes and ordering guards.

Phase Routing

PhaseDetectionSource Topic BehaviorProjection Topic Behavior
migrationuser_meta.full_name existsMaterialize full denormalized rows into user_metaForward projection row to ES through ingest pipeline
runninguser_meta.full_name absentKeep slim projection updated and resolve identity liveCompose full ES doc from projection + live userDetails

CDC_PHASE_OVERRIDE can force a phase for recovery. The override must be migration or running.

Source Table Handler Paths

TopicRunning HandlerMySQL WriteDirect ES Write
userDetails_process_running_user_detailsUpsert lab_id, last_updated_time, and selected projection metadataYes, after FieldResolver composes a full document
billing_process_running_source_mysql -> _handle_billingAppend lab_bill_ids, order_numbers, referral_ids, org_ids, branch_idsNo
labReportRelation_process_running_source_mysql -> _handle_lab_reportAppend manual_sample_idsNo
user_meta_process_running_projectionNoneYes, after identity is resolved from live userDetails

Field Resolver

cdc/consumers/field_resolver.py composes ES documents during running phase.

Event SourceIdentity FieldsAggregate FieldsReason
userDetails eventLive userDetails lookupCurrent user_meta lookupAvoid stale envelope identity on DLQ replay
user_meta eventLive userDetails lookupEvent after imageProjection event is the aggregate change being indexed

The resolver treats userDetails.labId_id = -1 as missing. That prevents merged-away patient rows from leaking sentinel data into Elasticsearch.

Retry and Ordering Model

MechanismCode PathPurpose
Partition routingDebezium PartitionRouting SMTSame user, same Kafka partition across all CDC topics
Per-partition workerconsumer.py::_partition_workerSequential handling within a partition
Handler retryconsumer.py::_handle_with_retriesRecover transient MySQL/ES/Kafka issues
MySQL deadlock retryrouter.py::_retry_on_deadlockRetry OperationalError 1213 with jittered backoff
Consumer DLQconsumer.py::_send_to_dlqPreserve poison messages after retries
CAS guardhandlers.py::_handle_user_details_runningReject stale identity updates by last_updated_time
CSV dedupe_upsert_csv_field, _append_csv_fieldsMake repeated billing/sample events safe
Tombstone guard_is_tombstoned_user, FieldResolverPrevent merged-away patients from being recreated

Data Architecture

Source Tables and Projection

StoreObjectRole
MySQLuserDetailsAuthoritative identity, demographics, lab routing, patient state
MySQLbillingOrder, bill, referral, org, and branch aggregate source
MySQLlabReportRelationManual sample ID aggregate source
MySQLuser_metaSearch projection table used by CDC and backfill
Elasticsearchuser_detailsSearch-optimized document index

Field Ownership

Field FamilySource of TruthProjection / Index Behavior
Identity and demographicsuserDetailsIndexed directly by CDC resolver or backfill join
lab_iduserDetails.labId_idUsed as ES routing and access filter
Patient IDs and identity IDsuserDetailsSearched through exact, prefix, suffix, or segment fields
Billing IDs and order numbersbillingAggregated into CSV in user_meta, transformed to arrays for ES
Org, referral, branch IDsbillingAggregated into recency-ordered CSV fields
Manual sample IDslabReportRelation via billingAggregated into manual_sample_ids
CDC freshnessES newest last_updated_timeProbed by API background task and exposed in /health

Elasticsearch Document Shape

The ES document combines identity plus aggregate arrays:

{
  "id": 101,
  "lab_id": 1,
  "full_name": "John Doe",
  "lab_patient_id": "P-1001",
  "contact": "9999999999",
  "manual_sample_ids": ["S-1001"],
  "order_numbers": ["ORD-1001"],
  "lab_bill_ids": ["5001"],
  "org_ids": [20],
  "referral_ids": [77],
  "branch_ids": [10],
  "last_updated_time": "2024-01-02T10:00:00Z"
}

The API search path depends on ES routing by lab_id. If a patient moves labs, CDC deletes the old routed document and indexes the new routed document.

Elasticsearch Routing Model

Phoenix Search uses Elasticsearch custom routing on the user_details index. The index mapping declares _routing.required = true, so every write, delete, and point lookup must pass the same routing key that was used when the document was indexed.

ConcernBehavior
ES document IDuser_details_id, stored in ES as id
ES routing keylab_id as a string
Source of routinguserDetails.labId_id, denormalized into user_meta.lab_id
Normal search routingCurrent session lab ID
Multi-center search routingComma-separated related lab IDs resolved from MySQL
Filter endpoint routingCurrent session lab ID
Detail endpointMySQL-backed, not an ES point lookup

Routing is not the only security boundary. The API also adds lab_id / org / branch / referral filters from the authenticated session. Routing targets the relevant ES shard or shards; filters enforce the allowed result scope.

Write paths must use the same rule:

WriterRouting Behavior
CDC userDetails handlerResolves full ES doc, indexes with routing=str(lab_id)
CDC user_meta handlerResolves projection doc, indexes/deletes with routing=str(lab_id)
CDC lab move / rerouteDeletes old doc with old lab_id, then indexes new doc with new lab_id
CDC merge tombstoneUses the previous lab_id from the before image to delete the old routed doc
Backfill bulk indexerSets bulk item Routing from user_meta.lab_id
Backfill targeted repairCalls ES index with WithRouting(lab_id)
Backfill verifyReads GET /user_details/_doc/<id>?routing=<lab_id>

Debugging must include routing. A document can exist under one routing key and appear missing under another:

curl -s -u elastic:<PASSWORD> \
  "https://<ES_HOST>:9200/user_details/_doc/<USER_DETAILS_ID>?routing=<LAB_ID>"

Consistency Model

ConcernGuarantee
Search freshnessEventually consistent from MySQL through CDC into ES
Detail lookupStronger source-of-truth read from MySQL
Per-user orderingPreserved by partition routing and per-partition workers
Global orderingNot guaranteed across different users or partitions
Duplicate deliveryExpected; handlers are designed to be idempotent
CDC outageAPI can still serve existing ES results, but freshness age increases
ES outageSearch API readiness degrades; CDC retries then DLQs failed writes
Redis outageAuth/session and rate-limit flows are affected
MySQL outageDetail lookup, scope resolution, CDC materialization, and resolver reads are affected

The important rule is that Elasticsearch is a projection, not the source of truth. When ES and MySQL disagree, MySQL wins and CDC/backfill should repair ES.


Failure Domains

FailureImmediate SymptomFirst Debug Page
API dependency down/health/ready returns 503Operations
Search stale/health has stale CDC body statusCDC
Debezium connector failedRedpanda topics stop receiving source eventsCDC Tools and Backfill
Consumer lag highcdc_consumer_lag risesOperations
Projection corrupt or incompleteES disagrees with user_meta / MySQLCDC Tools and Backfill
Query returns unexpected resultsWrong search_key, search_fields, session scope, or ES routingAPI Reference

Source References

AreaFiles
App assemblysearch/web/application.py, search/web/api/router.py
Startup and dependency lifecyclesearch/web/lifespan.py, search/services/*/lifespan.py
Auth and session contextsearch/services/auth/middleware.py, search/services/auth/dependencies.py, search/services/auth/schemas.py
Search service pathsearch/domains/user_search/router.py, service.py, repository.py, query.py, filters.py, context.py
MySQL lookup pathsearch/domains/user_search/queries.py
CDC consumer orchestrationcdc/consumers/consumer.py
CDC routing and handlerscdc/consumers/router.py, handlers.py, field_resolver.py, phase_detector.py
Debezium connector configstools/cdc-ctl/connectors/source-connector-existing.production.json, source-connector-projection.production.json
Backfill architecturebackfill/main.go, scanner.go, indexer.go, migrate.go, row.go, transform.go

On this page