| Tool |
Description |
list_clusters |
List all managed clusters (name, version, health) |
list_services |
List services on a cluster |
get_service_logs |
Extract service logs with time range filtering |
get_alerts |
Get cluster alerts and events |
get_service_metrics |
Time-series metrics via tsquery |
get_config |
Read service configuration |
update_config |
Write service configuration |
run_service_command |
Execute async CM command (restart, start, stop, deploy config, etc.) |
get_command_status |
Poll async command status |
get_host_status |
Host health, roles, rack info |
get_audit_events |
CM audit log (login, config changes, command executions) |
list_datahubs |
Enumerate DataHub clusters |
refresh_cluster_map |
Rebuild cluster→CM mapping after failover or new cluster |
get_mgmt_service |
CM Management Service health and role status (Host Monitor, Service Monitor, Alert Publisher, etc.) |
delete_service |
Delete a stale/orphaned service from a cluster — irreversible |
delete_role |
Delete a stale/decommissioned role instance — irreversible |
Destructive operations
delete_service and delete_role call DELETE on the CM API and cannot be undone.
The target must be in stopped state before deletion — CM will return an error otherwise.
Use run_service_command with command="stop" first if needed.
| Tool |
Description |
registry_list |
List registered CM instances (passwords excluded) |
registry_stats |
Statistics: total, active/inactive count, by environment |
registry_add |
Register a new CM instance at runtime |
registry_deactivate |
Soft-delete a CM instance (keeps it in registry) |
registry_update_field |
Update a single field (e.g. password, port) |
registry_reload |
Hot-reload registry from YAML/Iceberg without restart |
| Tool |
Description |
get_yarn_app |
Application details, diagnostics, resource usage and timing |
list_yarn_apps |
List applications filtered by state / queue / user |
get_yarn_queue |
Scheduler queue capacity and active/pending applications |
Spark History Server Tools
| Tool |
Description |
get_spark_app |
Spark application summary (duration, executor time, attempt count) |
get_spark_stages |
Stage details including failure reason (truncated to 300 chars) |
list_spark_apps |
List Spark applications filtered by status |
| Tool |
Description |
get_namenode_status |
NameNode health (HEALTHY / DEGRADED / CRITICAL), capacity, corrupt/missing blocks, HA state |
| Tool |
Description |
get_oozie_job |
Workflow or coordinator job details with action list and YARN app IDs |
list_oozie_jobs |
List jobs filtered by status / type / user |
Diagnostic Workflows
Job failed — why?
list_yarn_apps with state=FAILED → find the app_id
get_yarn_app → read diagnostics field
get_spark_stages with status=FAILED → find failureReason
get_service_logs → deep dive into YARN / Spark logs
HDFS issues?
get_namenode_status → check health_summary, corrupt_blocks, missing_blocks
get_alerts with severity=CRITICAL → related alerts
get_service_logs for HDFS → NameNode log details
Resource contention?
get_yarn_queue → check used_capacity vs capacity
list_yarn_apps with state=RUNNING → who is consuming resources
get_service_metrics → trend over time
CM internal health?
get_mgmt_service → Host Monitor, Service Monitor, Alert Publisher status
get_alerts → any unacknowledged critical alerts
get_host_status → per-host health and role assignment
Cleanup — remove stale service or role?
list_services → confirm the service name
run_service_command with command="stop" → stop the service first
delete_service → remove it from CM
For a single stale role (e.g. orphaned HiveServer2):
list_services → identify the service
- Stop the specific role via
run_service_command or CM UI
delete_role with the full role name (e.g. hive-HIVESERVER2-abc123def456)