ServicesPhoenix SearchOperate

Operations

Run, deploy, monitor, and debug Phoenix Search

👤 Sai Tharun

Phoenix Search Operations

This page collects the day-to-day commands and on-call checks for the Phoenix Search API and its CDC subsystem.

For the full data migration sequence, cdc-ctl, and the Go backfill tool, see CDC Tools and Backfill. For request-to-trace debugging, see API Debugging.


Run Locally

API Only

make install
cp .env.example .env
cp docker/.env.example docker/.env
make up-dev
make seed
make run

Full Local Stack

make up-all
make register-pipeline
make run-cdc
make register-source-connector

Test Infra

make test
make test-docker
make test-docker-up
make test-docker-down

API Health Checks

EndpointExpected Healthy ResponseNotes
GET /health/live200 {"status":"alive"}Process liveness only
GET /health/ready200 with status: readyReturns 503 if Elasticsearch, Redis, or MySQL is down
GET /health200 with status: healthy or degradedIncludes dependency status and CDC freshness
GET /metricsPrometheus textExposes API, search, and CDC freshness metrics

/health intentionally returns HTTP 200 even when degraded. Alert on the body fields, especially dependency statuses and cdc.status.

Example:

curl -s http://localhost:8000/health
curl -s http://localhost:8000/health/ready
curl -s http://localhost:8000/metrics

CDC Health Checks

The CDC consumer exposes its own health server, default port 8080.

EndpointPurposeHealthyUnhealthy
GET /healthLiveness200 when Kafka polling is recent503
GET /readyReadiness200 when processing messages503
GET /metricsPrometheusPrometheus text format-

Example:

curl http://<TASK_IP>:8080/health
curl http://<TASK_IP>:8080/ready
curl http://<TASK_IP>:8080/metrics

CDC Runbook Order

When Phoenix Search data looks stale, check the planes in this order. It prevents chasing Elasticsearch symptoms when the problem is actually a connector, Redpanda, or consumer-group issue.

StepQuestionCommand
1Is the API seeing stale indexed data?curl -s http://<API_HOST>/health
2Are Debezium connectors running?CDC_CTL_ENV=production ./cdc-ctl status
3Are Redpanda topics healthy and replicated?CDC_CTL_ENV=production ./cdc-ctl topics verify
4Is the consumer group stable and caught up?CDC_CTL_ENV=production ./cdc-ctl lag
5Are records failing into DLQ?CDC_CTL_ENV=production ./cdc-ctl dlq inspect --n 50 --timeout 15s
6Are MySQL binlog and heartbeat prerequisites healthy?CDC_CTL_ENV=production ./cdc-ctl mysql check
7Do we need a full incident bundle?CDC_CTL_ENV=production ./cdc-ctl debug --tarball

The operator CLI writes artifacts for every run under tools/cdc-ctl/runs/<timestamp>_<subcommand>/. Use those artifacts when handing an incident to another engineer.


Redpanda Runbook

Access Pattern

In production, run rpk from a Redpanda broker or another host inside the VPC. Laptop access to the internal SASL listener is not expected.

ssh -i ~/Documents/pem-files/redpanda.pem ubuntu@<REDPANDA_PUBLIC_IP>

export REDPANDA_BROKERS=<BROKER_1_PRIVATE_IP>:9093,<BROKER_2_PRIVATE_IP>:9093,<BROKER_3_PRIVATE_IP>:9093

rpk topic list \
  --user admin \
  --password '<ADMIN_PASSWORD>' \
  --sasl-mechanism SCRAM-SHA-256
EndpointPurpose
https://redpanda-in-console.crelio.solutionsRead-only Redpanda Console for consumer groups, topic messages, and lag
https://redpanda-in-admin-console.crelio.solutionsAdmin Redpanda Console
<REDPANDA_PUBLIC_IP>:9644/public_metricsRedpanda Prometheus metrics scraped by the OTel collector
<BROKER_PRIVATE_IP>:9093Internal SASL plaintext listener used by services inside the VPC

Cluster and Topic Checks

# Broker and partition health.
rpk cluster health \
  --user admin \
  --password '<ADMIN_PASSWORD>' \
  --sasl-mechanism SCRAM-SHA-256

# Expected Phoenix topics.
rpk topic list \
  --user admin \
  --password '<ADMIN_PASSWORD>' \
  --sasl-mechanism SCRAM-SHA-256

# Partition, leader, replica, and high-watermark details.
rpk topic describe phoenix.livehealthapp.userDetails -p \
  --user admin \
  --password '<ADMIN_PASSWORD>' \
  --sasl-mechanism SCRAM-SHA-256

Expected Phoenix data topics:

TopicWritten ByRead By
phoenix.livehealthapp.userDetailsphoenix-source-existing Debezium connectorCDC consumer MySQL materializer
phoenix.livehealthapp.billingphoenix-source-existing Debezium connectorCDC consumer MySQL materializer
phoenix.livehealthapp.labReportRelationphoenix-source-existing Debezium connectorCDC consumer MySQL materializer
phoenix.livehealthapp.user_metaphoenix-source-projection Debezium connectorCDC consumer Elasticsearch syncer
phoenix.cdc.connector-dlqKafka Connect / DebeziumOperators
phoenix.cdc.dead-letter-queuePython CDC consumerOperators

Production Replication Guardrails

On the 3-node Redpanda cluster, Phoenix CDC topics should use replication factor 3. Critical topics should also use min.insync.replicas=2, so writes fail fast if too many brokers are unavailable.

Topic ClassRequired Guardrail
Debezium data topicsreplication.factor=3, 6 partitions
Debezium schema history topicsreplication.factor=3
Kafka Connect internal topicsconnect-configs, connect-offsets, and connect-status with RF=3
Critical CDC topicsmin.insync.replicas=2
Kafka Connect producerproducer.acks=all; producer.enable.idempotence=true

Check the expected topic policy with the operator CLI first:

CDC_CTL_ENV=production ./cdc-ctl topics list
CDC_CTL_ENV=production ./cdc-ctl topics verify
CDC_CTL_ENV=production ./cdc-ctl topics describe phoenix.livehealthapp.userDetails

If a cluster was expanded from RF=1 to RF=3, existing topics do not automatically become RF=3. The recovery notes in cdc/CDC_RECOVERY_LOG.md document this as an explicit migration step.

Consumer Group and Rebalance Checks

rpk group describe phoenix-cdc-unified \
  --user admin \
  --password '<ADMIN_PASSWORD>' \
  --sasl-mechanism SCRAM-SHA-256
FieldHealthy MeaningWhat to Do if Bad
STATEStablePreparingRebalance is normal for 10-30 seconds after deploy/scale; if it stays longer than 60 seconds, check task crashes and auth errors
MEMBERSMatches ECS desired countIf lower, a task failed to join the group or is crash-looping
TOTAL-LAGDraining or below alert thresholdIf growing, check whether producer rate is snapshot-driven or the consumer is bottlenecked
Per-partition LAGRoughly balancedOne hot partition usually means one user/key stream or one bad consumer member
HOST / MEMBER-IDMaps partitions to a taskUse it to find the exact ECS task logs

Consumer group rebalance happens when the CDC service scales, a task restarts, or a member stops polling. During rebalance, Redpanda revokes and reassigns partitions, and the Phoenix consumer pauses processing briefly to preserve in-partition ordering.

The safe scaling ceiling is the topic partition count: 6 consumer tasks. More than 6 tasks are usually idle because Phoenix topics are created with 6 partitions.

aws ecs update-service \
  --cluster phoenix \
  --service cdc-consumer \
  --desired-count 6

Lag Interpretation

PatternLikely CauseNext Check
Lag spikes on only userDetails with many op: "r" messagesDebezium incremental snapshot signalCheck livehealthapp.debezium_signal
Lag grows on all topicsConsumer-wide bottleneck, MySQL issue, ES issue, or task crashConsumer logs, MySQL processlist, ES write thread pool
Lag is only on partitions owned by one memberBad task or poison message loopECS logs for that task, DLQ metrics
MEMBERS=0Consumer service down or cannot join groupECS service desired/running count and SASL ACLs
GROUP_AUTHORIZATION_FAILEDCDC_CONSUMER_GROUP does not match ACLSet CDC_CONSUMER_GROUP=phoenix-cdc-unified
UNKNOWN_TOPIC_OR_PARTDebezium topics were not createdCheck connector status and connector task trace
NOT_LEADER_FOR_PARTITIONBroker restart or partition movementUsually transient; if persistent, run rpk cluster health

Sample messages when you need to confirm whether a lag spike is live traffic or a snapshot:

rpk topic consume phoenix.livehealthapp.userDetails \
  --offset end \
  -n 1 \
  --user admin \
  --password '<ADMIN_PASSWORD>' \
  --sasl-mechanism SCRAM-SHA-256 \
  -f '%T p%p:o%o %v\n'

In Debezium envelopes, op: "r" means snapshot/read, op: "c" means insert, op: "u" means update, and op: "d" means delete.


Debezium Runbook

Connector Status

Phoenix Search has two Debezium MySQL connectors.

ConnectorCapturesOutput
phoenix-source-existinguserDetails, billing, labReportRelationSource CDC topics consumed into user_meta
phoenix-source-projectionuser_metaProjection topic consumed into Elasticsearch
CDC_CTL_ENV=production ./cdc-ctl status
CDC_CTL_ENV=production ./cdc-ctl offsets

curl -s http://<CONNECT_HOST>:8083/connectors/phoenix-source-existing/status | jq .
curl -s http://<CONNECT_HOST>:8083/connectors/phoenix-source-projection/status | jq .

Check these fields in the status response:

FieldExpected
connector.stateRUNNING
tasks[].stateRUNNING
tasks[].traceEmpty unless a task failed

Restart failed tasks first:

CDC_CTL_ENV=production ./cdc-ctl restart --only-failed

Use connector recreation only when status or config is corrupted. Connector deletion preserves offsets in connect-offsets; it does not clear them.

CDC_CTL_ENV=production ./cdc-ctl recreate --dry-run
CDC_CTL_ENV=production ./cdc-ctl recreate --yes --settle-seconds 30

Debezium Signals

Debezium signals are for asking a running connector to re-read rows into Kafka without resetting connector offsets. Phoenix enables source signals through livehealthapp.debezium_signal on both connectors.

SELECT *
FROM livehealthapp.debezium_signal
ORDER BY id DESC
LIMIT 5;

Use an incremental snapshot signal when the connector needs to re-capture rows:

INSERT INTO livehealthapp.debezium_signal (id, type, data)
VALUES (
  'resync-user-details-20260520',
  'execute-snapshot',
  '{"data-collections": ["livehealthapp.userDetails"], "type": "incremental"}'
);

Use a projection-table snapshot when user_meta was repaired and Elasticsearch needs to receive projection events:

INSERT INTO livehealthapp.debezium_signal (id, type, data)
VALUES (
  'resync-user-meta-20260520',
  'execute-snapshot',
  '{"data-collections": ["livehealthapp.user_meta"], "type": "incremental"}'
);

Signals produce snapshot read events into the same Redpanda topics as live binlog events, so lag can spike while the snapshot runs. For normal projection recomputation, prefer the backfill repair modes; signals re-emit rows, they do not recompute billing/sample aggregates by themselves.

For the detailed signal flow and external Debezium references, see CDC.

Connector Recovery Guardrails

SituationSafe First ActionAvoid
Task is FAILED with a transient MySQL or network errorcdc-ctl restart --only-failedDeleting offsets
Connector config drifted from repo JSONcdc-ctl apply --dry-run, then cdc-ctl applyManual config edits in Connect UI
Binlog position is near oldest available binlogBackfill/repair plan before offset resetWaiting until MySQL purges the required binlog
Schema history topic is missing or corruptFollow cdc/CDC_RECOVERY_LOG.md recovery sequencesnapshot.mode=never with empty schema history
Kafka Connect internal offsets are corruptUse cdc-ctl recover with confirmationDeleting connect-offsets while Connect is still running

Before touching offsets, capture connector status, offsets, topic state, MySQL binlogs, and a debug bundle:

CDC_CTL_ENV=production ./cdc-ctl status
CDC_CTL_ENV=production ./cdc-ctl offsets
CDC_CTL_ENV=production ./cdc-ctl topics verify
CDC_CTL_ENV=production ./cdc-ctl mysql check
CDC_CTL_ENV=production ./cdc-ctl debug --tarball

Deployment References

API Runtime

The API starts from the Phoenix Search package entrypoint and reads configuration from SEARCH_* environment variables. In production, the ALB can forward traffic with /phoenix-search/*; the service strips that prefix internally.

Core runtime dependencies:

DependencyRequired For
ElasticsearchUser search and CDC freshness probe
RedisWeb session lookup and rate-limit state
MySQLUser detail lookup and search-scope resolution
OTLP endpointOpenTelemetry traces, metrics, and logs
Sentry DSNError reporting

API Configuration Variables

Defaults are defined in search/settings.py. The settings prefix is SEARCH_, so a field named mysql_host is configured as SEARCH_MYSQL_HOST.

VariableDefaultDescription
SEARCH_HOST127.0.0.1Bind host for the API server
SEARCH_PORT8000Bind port
SEARCH_WORKERS_COUNT1Uvicorn/Gunicorn worker count
SEARCH_ENVIRONMENTdevRuntime environment: dev, e2e, pytest, staging, or production
SEARCH_LOG_LEVELINFOApplication log level
SEARCH_LOG_JSONtrueEmit JSON logs when true

Elasticsearch:

VariableDefaultDescription
SEARCH_ES_URLhttp://localhost:9210Elasticsearch endpoint
SEARCH_ES_USERelasticElasticsearch username
SEARCH_ES_PASSWORDemptyElasticsearch password
SEARCH_ES_API_KEYemptyOptional API key authentication
SEARCH_ES_CA_CERT/app/certs/ca.crtCA certificate path for HTTPS clusters
SEARCH_ES_MAX_RETRIES3ES client retry count
SEARCH_ES_REQUEST_TIMEOUT30ES client request timeout in seconds
SEARCH_ES_SEARCH_TIMEOUT5Per-search request timeout
SEARCH_ES_MAX_CONNECTIONS20ES HTTP connection pool size
SEARCH_ES_CAPTURE_SEARCH_QUERYfalseCapture search query in ES OTEL instrumentation

Search behavior and freshness:

VariableDefaultDescription
SEARCH_DEFAULT_SIZE10Default hit count returned by user search
SEARCH_RECENCY_DECAY_ENABLEDfalseEnables function_score recency ranking
SEARCH_RECENCY_DECAY_SCALE_DAYS90Recency decay scale
SEARCH_RECENCY_DECAY_OFFSET_DAYS7Recency grace period
SEARCH_RECENCY_DECAY_FACTOR0.5Recency decay factor
SEARCH_RECENCY_DECAY_WEIGHT1.5Recency scoring weight
SEARCH_CDC_STALE_THRESHOLD_SECONDS600CDC freshness threshold used by /health
SEARCH_CDC_PROBE_INTERVAL_SECONDS30CDC freshness probe interval

Redis, auth, and rate limiting:

VariableDefaultDescription
SEARCH_REDIS_HOSTlocalhostRedis host for sessions and shared KV
SEARCH_REDIS_PORT7001Redis port
SEARCH_REDIS_CONNECTION_TYPEclustercluster or standalone
SEARCH_REDIS_USERemptyRedis username
SEARCH_REDIS_PASSemptyRedis password
SEARCH_SESSION_COOKIE_AGE28800Session max age in seconds
SEARCH_JWT_SECRETemptyMobile JWT secret
SEARCH_PY2_JWT_SECRETemptyLegacy Python 2 JWT secret
SEARCH_EPHEMERAL_JWT_SECRETemptyEphemeral web JWT secret
SEARCH_EPHEMERAL_JWT_MAX_AGE_SECONDS300Ephemeral token max age
SEARCH_RATE_LIMIT_REDIS_URLemptyDedicated Redis URL for dynamic rate limits; empty disables dynamic rate limiting
SEARCH_RATE_LIMIT_PER_MINUTE120Default search route limit
SEARCH_RATE_LIMIT_DEGRADED_PER_MINUTE30Degraded-mode route limit
SEARCH_RATE_LIMIT_EMERGENCY_PER_MINUTE5Emergency-mode route limit
SEARCH_RATE_LIMIT_CONFIG_CACHE_TTL5Seconds to cache the active rate-limit mode

MySQL and telemetry:

VariableDefaultDescription
SEARCH_MYSQL_HOSTmysql-dbMySQL host
SEARCH_MYSQL_PORT3306MySQL port
SEARCH_MYSQL_USERlivehealth-localMySQL username
SEARCH_MYSQL_PASSWORDemptyMySQL password
SEARCH_MYSQL_DATABASElivehealthappMySQL database
SEARCH_MYSQL_POOL_SIZE10MySQL pool size
SEARCH_MYSQL_POOL_RECYCLE3600MySQL connection recycle interval in seconds
SEARCH_OPENTELEMETRY_ENDPOINTemptyOTLP gRPC endpoint
SEARCH_OPENTELEMETRY_API_KEYemptyOTLP authorization value
SEARCH_DEPLOYMENT_REGIONap-south-1Region resource attribute
SEARCH_SENTRY_DSNemptySentry DSN
SEARCH_SENTRY_SAMPLE_RATE1.0Sentry tracing sample rate

CDC Consumer Runtime

The CDC consumer should be deployed as an ECS task with:

SettingValue
Container image<ECR_REPO>/phoenix-cdc:latest
Health port8080
LivenessGET /health
ReadinessGET /ready
Desired countStart at 1; scale up to 6 if lag requires it
EnvironmentCDC_* variables from env file or Secrets Manager

Operational Make Targets

TargetDescription
make up-devStart Redis, Elasticsearch, and Kibana
make up-allStart all local services plus HyperDX
make downStop local containers
make runStart the API
make lintRun Ruff and mypy
make testUnit tests without Docker
make test-dockerFull test suite with Docker infra
make es-api-keyGenerate an Elasticsearch API key into .env
make register-pipelineRegister user-search-projection-pipeline
make run-cdcStart Redpanda, Debezium Connect, Redpanda Console, and consumer
make down-cdcStop CDC containers
make destroy-cdcStop CDC containers and remove volumes
make cdc-logsTail CDC container logs
make redpanda-consoleOpen Redpanda Console at local port 8100
make register-source-connectorRegister both Debezium source connectors
make connector-statusShow Debezium connector health
make dashboards-pushPush HyperDX dashboards
make backfill-migratePopulate user_meta from MySQL source tables
make backfill-runBackfill Elasticsearch from the projection

Metrics and Alerts

API Metrics

MetricDescription
search_data_age_secondsAge of the newest document in Elasticsearch
search_cdc_healthy1 when CDC freshness is below threshold, otherwise 0
Search query metricsQuery count, duration, hit count, zero-result count
Auth metricsSession and token auth success/failure counts
Bucket metricsHit count by response bucket

CDC freshness is controlled by:

VariableDefaultDescription
SEARCH_CDC_STALE_THRESHOLD_SECONDS600Age above which search data is stale
SEARCH_CDC_PROBE_INTERVAL_SECONDS30Probe interval

CDC Consumer Metrics

MetricDescription
cdc_upPoll loop is alive
cdc_last_poll_seconds_agoSeconds since the last Kafka poll
cdc_last_success_seconds_agoSeconds since the last processed message
cdc_consumer_lagTotal lag across partitions
cdc_messages_processed_totalSuccessfully processed messages
cdc_messages_failed_totalFailed messages
cdc_messages_dlq_totalMessages sent to DLQ
cdc_messages_retried_totalRetry attempts

Alert Conditions

SignalConditionSeverityFirst Check
cdc_up== 0 for more than 2 minutesCriticalConsumer task and Kafka connectivity
search_cdc_healthy / search.cdc.healthy== 0 for more than 5 minutesCriticalCDC consumer, connectors, Redpanda, Elasticsearch
search_data_age_seconds / search.data.ageAverage over 300 seconds for more than 5 minutesWarningConsumer lag and processing latency
cdc_consumer_lagOver 10,000 for 10-15 minutesWarningScale consumer or inspect slow handlers
cdc_messages_failed_totalIncreasing for 5 minutesWarningConsumer logs and DLQ
cdc_messages_dlq_totalAny incrementWarningDLQ topic payload and original offset metadata

Production Metrics Snapshot

These values are from the Phoenix Search production IN dashboard screenshots. Treat this section as observed Phase 1 production evidence, not a permanent SLO target.

For the full migration outcome, old-vs-new index comparison, request-count reduction, analyzer mapping notes, and OpenTelemetry advantages, see Post-Migration Results.

Success Signals

AreaObserved ValueWhy It Matters
API request errors0% error rate in the HTTP service dashboardNo visible request-level failure rate in the observed window
Top endpoint errors0 errors/min for the highest traffic endpointsSearch and detail routes were not producing endpoint-level errors
Unhandled 500sNo visible unhandled 500 seriesNo observed server-error spike
CDC health1CDC freshness probe reports healthy
ES data freshnessAround 3.6s document age in the dashboard tooltipSearch index is staying close to MySQL updates
ES cluster healthGREENCluster is serving with expected shard health
ES active data nodes3All production data nodes are active
user_details index statusOpen, HealthyMain search index is available

API Traffic and Latency

MetricObserved Value
Dominant endpointPOST /api/v1/users/search
Search endpoint shareAbout 97.39% of endpoint time in the HTTP service view
Search endpoint request rateAbout 242.5 req/min in the top endpoints table
Search endpoint median latencyAbout 22.76 ms
Search endpoint p95 latencyAbout 47.67 ms
Detail endpoint shareAbout 2.13%
Detail endpoint request rateAbout 14.2 req/min
Detail endpoint median latencyAbout 6.99 ms
Detail endpoint p95 latencyAbout 10.11 ms
Overall request latencyMedian roughly 21-23 ms; p95 roughly 43-48 ms across the shown screenshots
Peak request throughputPeriodic peaks near 230K-250K requests per dashboard bucket

POST /api/v1/users/search is the only endpoint that materially drives API cost in the observed window. Detail lookup traffic is much smaller and faster.

Search, MySQL, and ES Query Performance

MetricObserved Value
ES query latencyp50 ~2.5 ms, p95 ~4.75 ms, p99 ~4.95 ms
MySQL query latencyp50 ~2.5 ms, p95 ~4.75 ms
ES hits per queryAround 4.6 in the dashboard tooltip
Top search keys by volumepatient_name and multi_center are the largest visible series
Zero-result shapealpha is the largest visible zero-result shape

The API p95 is higher than raw ES/MySQL query latency because it includes request handling, auth/session work, filter resolution, query building, response shaping, and network/runtime overhead.

Elasticsearch Cluster and Index Capacity

MetricObserved Value
user_details storage67.61 GB primary, 135.75 GB total
user_details shard layout6 primary shards, 1 replica
user_details documentsAbout 157.44M documents
HTTP connectionsPeriodic range around 15-55
Data node CPUAround 2-4% in the node load table
Coordinator CPUAround 17-22% in the node load table
Search operation ratePeriodic per-node peaks near the 100K-140K chart range
Indexing operation ratePeriodic per-node peaks near the 300K-400K chart range

The index is large enough that routing matters operationally. For point checks, always use the lab_id route:

curl -s -u elastic:<PASSWORD> \
  "https://<ES_HOST>:9200/user_details/_doc/<USER_DETAILS_ID>?routing=<LAB_ID>"

Dashboard Evidence

Phoenix Search API production dashboard

Phoenix Search HTTP service metrics

Phoenix Search Elasticsearch production dashboard

user_details index overview


Dashboards

Dashboard definitions are maintained as code in the Phoenix Search repo.

FileDashboard
dashboards/cdc.jsonCDC pipeline dashboard
dashboards/cdc-alerts.jsonCDC alert definitions
dashboards/debezium.jsonDebezium Connect dashboard
dashboards/redpanda.jsonRedpanda broker dashboard

Common commands:

phoenix-search dashboards push --api-key YOUR_KEY --env production
phoenix-search dashboards alerts-push \
  --api-key YOUR_KEY \
  --env production \
  --alert-channel slack_webhook \
  --alert-webhook-id YOUR_WEBHOOK_ID
phoenix-search dashboards diff --api-key YOUR_KEY

Makefile shortcut:

make dashboards-push API_KEY=your_key ENV=production

Common Debugging

API returns no search results

Check:

  1. Session resolves to the expected lab, branch, organization, and collection-center scope.
  2. The selected search_key includes the field being searched.
  3. search_fields does not filter out every valid field for the selected mapping.
  4. Elasticsearch routing matches the expected lab ID.
  5. The ES circuit breaker is not open.

Useful commands:

curl -s http://localhost:8000/health
curl -s http://localhost:8000/metrics | grep search
curl -s -u elastic:<PASSWORD> "http://localhost:9210/user_details/_count"
curl -s -u elastic:<PASSWORD> \
  "http://localhost:9210/user_details/_doc/<USER_DETAILS_ID>?routing=<LAB_ID>"

Always include routing=<LAB_ID> when checking a specific ES document. The user_details index requires routing, and Phoenix Search indexes each document under its lab_id.

/health/ready returns 503

Read the dependency section in the response body. The readiness endpoint checks Elasticsearch, Redis, and MySQL.

curl -s http://localhost:8000/health/ready

Typical fixes:

DependencyFirst Check
ElasticsearchContainer status, ES credentials, cluster health
RedisHost, port, cluster mode, credentials
MySQLHost, security group, credentials, pool exhaustion

CDC consumer not processing messages

Symptoms: consumer /health is 200, /ready is 503, or processed-message counters stop increasing.

curl -s http://<CONNECT_HOST>:8083/connectors/phoenix-source-existing/status
rpk group describe phoenix-cdc-unified
docker logs <container_id> --tail 100

Common causes:

  • Debezium connector is down.
  • Consumer group rebalance is stuck.
  • All messages are failing and retries are exhausting.

Debezium connector failed

curl -s http://<CONNECT_HOST>:8083/connectors/phoenix-source-existing/status
curl -X POST http://<CONNECT_HOST>:8083/connectors/phoenix-source-existing/tasks/0/restart

Common causes:

  • MySQL binlog expired and the connector lost its position.
  • MySQL network or credential failure.
  • Schema change on a captured table.

Messages are going to DLQ

rpk topic consume phoenix.cdc.dead-letter-queue --num 5

Check the message headers for the original topic, partition, and offset. Common causes are malformed Debezium events, MySQL pool exhaustion, and Elasticsearch mapping conflicts.

Consumer lag is increasing

rpk group describe phoenix-cdc-unified
curl -s http://<TASK_IP>:8080/metrics | grep cdc_messages_processed

Possible fixes:

  • Scale the consumer up to the partition count.
  • Increase CDC_MYSQL_POOL_SIZE if MySQL writes are bottlenecked.
  • Check MySQL with SHOW PROCESSLIST.
  • Check Elasticsearch write thread pool and mapping failures.

CDC phase detection is wrong

mysql -e "SHOW COLUMNS FROM user_meta" | grep full_name

Force the phase when needed:

CDC_PHASE_OVERRIDE=running
CDC_PHASE_OVERRIDE=migration

DDL is blocked on captured tables

Debezium can hold metadata locks on captured tables. Stop both connectors before DDL, then resume them after the migration.

curl -X PUT http://<CONNECT_HOST>:8083/connectors/phoenix-source-projection/stop
curl -X PUT http://<CONNECT_HOST>:8083/connectors/phoenix-source-existing/stop

# Run DDL here.

curl -X PUT http://<CONNECT_HOST>:8083/connectors/phoenix-source-projection/resume
curl -X PUT http://<CONNECT_HOST>:8083/connectors/phoenix-source-existing/resume

If the Kafka Connect version does not support stop, delete and recreate the connectors with the same names. Offsets are preserved in the Connect offsets topic as long as they are not purged.


Backfill and Recovery

Use the Go backfill tool for initial migration, full reindexing, or recovery when CDC is offline.

make backfill-build
make backfill-migrate
make backfill-run BACKFILL_ENV=dev
make backfill-run BACKFILL_ARGS="--verify --verify-sample 50"

For a fresh migration:

make backfill-migrate-reset
make backfill-run-tuned BACKFILL_ENV=production

The migration first populates user_meta, then the ES backfill pushes projected rows to Elasticsearch with preflight checks. CDC should be coordinated around backfill so live changes do not fight the rebuild.


Clean Restart of CDC

Use only when CDC state is corrupted and normal connector or consumer restarts do not recover processing.

aws ecs update-service \
  --cluster phoenix \
  --service cdc-consumer \
  --desired-count 0

curl -X DELETE http://<CONNECT_HOST>:8083/connectors/phoenix-source-existing
curl -X DELETE http://<CONNECT_HOST>:8083/connectors/phoenix-source-projection

rpk group delete phoenix-cdc-unified

curl -X POST http://<CONNECT_HOST>:8083/connectors \
  -H "Content-Type: application/json" \
  -d @tools/cdc-ctl/connectors/source-connector-existing.production.json

curl -X POST http://<CONNECT_HOST>:8083/connectors \
  -H "Content-Type: application/json" \
  -d @tools/cdc-ctl/connectors/source-connector-projection.production.json

aws ecs update-service \
  --cluster phoenix \
  --service cdc-consumer \
  --desired-count 1

Source References

Runbook AreaSource
API startup, health, metricssearch/web/application.py, search/web/lifespan.py, search/web/api/monitoring/views.py
API settingssearch/settings.py
CDC deployment and troubleshootingcdc/RUNBOOK.md
CDC configcdc/consumers/config.py
CDC reference docscdc/docs/README.md
Backfill commandsREADME.md, backfill/RUNBOOK.md

On this page