Apache Mesos is an open-source project to manage computer clusters. It abstracts CPU, memory, storage and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to be built and run effectively.

Why use the Apache Mesos Telegraf Plugin?

The Apache Mesos Telegraf Plugin allows you to collect observability metrics provided by the Mesos master and agent nodes and insert them into your InfluxDB instance. The plugin can collect a set of metrics that enable cluster operators to monitor resource usage and detect issues before they become a problem.

How to monitor Apache Mesos using the Telegraf plugin

The Apache Mesos Telegraf Plugin will collect metrics from Apache Mesos and insert them into InfluxDB. By default, this plugin is not configured to gather metrics from Mesos since a cluster can be deployed in numerous ways. You will need to specify master/slave nodes for this plugin to gather metrics from.

Key Apache Mesos metrics to use for monitoring

Some of the important Apache Mesos metrics that you should proactively monitor include:

Resources:

  • master/cpus_percent Percentage of allocated CPUs
  • master/cpus_used Number of allocated CPUs
  • master/cpus_total Number of CPUs
  • master/cpus_revocable_percent Percentage of allocated revocable CPUs
  • master/cpus_revocable_total Number of revocable CPUs
  • master/cpus_revocable_used Number of allocated revocable CPUs
  • master/disk_percent Percentage of allocated disk space
  • master/disk_used Allocated disk space in MB
  • master/disk_total Disk space in MB
  • master/disk_revocable_percent Percentage of allocated revocable disk space
  • master/disk_revocable_total Revocable disk space in MB
  • master/disk_revocable_used Allocated revocable disk space in MB
  • master/gpus_percent Percentage of allocated GPUs
  • master/gpus_used Number of allocated GPUs
  • master/gpus_total Number of GPUs
  • master/gpus_revocable_percent Percentage of allocated revocable GPUs
  • master/gpus_revocable_total Number of revocable GPUs
  • master/gpus_revocable_used Number of allocated revocable GPUs
  • master/mem_percent Percentage of allocated memory
  • master/mem_used Allocated memory in MB
  • master/mem_total Memory in MB
  • master/mem_revocable_percent Percentage of allocated revocable memory
  • master/mem_revocable_total Revocable memory in MB
  • master/mem_revocable_used Allocated revocable memory in MB

Master

  • master/elected Whether this is the elected master
  • master/uptime_secs Uptime in seconds

System

  • system/cpus_total Number of CPUs available in this master node
  • system/load_15min Load average for the past 15 minutes
  • system/load_5min Load average for the past 5 minutes
  • system/load_1min Load average for the past minute
  • system/mem_free_bytes Free memory in bytes
  • system/mem_total_bytes Total memory in bytes

Slaves

  • master/slave_registrations
  • master/slave_removals
  • master/slave_reregistrations
  • master/slave_shutdowns_scheduled
  • master/slave_shutdowns_canceled
  • master/slave_shutdowns_completed
  • master/slaves_active
  • master/slaves_connected
  • master/slaves_disconnected
  • master/slaves_inactive
  • master/slave_unreachable_canceled
  • master/slave_unreachable_completed
  • master/slave_unreachable_scheduled
  • master/slaves_unreachable

frameworks

  • master/frameworks_active
  • master/frameworks_connected
  • master/frameworks_disconnected
  • master/frameworks_inactive
  • master/outstanding_offers

framework offers

  • master/frameworks/subscribed
  • master/frameworks/calls_total
  • master/frameworks/calls
  • master/frameworks/events_total
  • master/frameworks/events
  • master/frameworks/operations_total
  • master/frameworks/operations
  • master/frameworks/tasks/active
  • master/frameworks/tasks/terminal
  • master/frameworks/offers/sent
  • master/frameworks/offers/accepted
  • master/frameworks/offers/declined
  • master/frameworks/offers/rescinded
  • master/frameworks/roles/suppressed

tasks

  • master/tasks_error
  • master/tasks_failed
  • master/tasks_finished
  • master/tasks_killed
  • master/tasks_lost
  • master/tasks_running
  • master/tasks_staging
  • master/tasks_starting
  • master/tasks_dropped
  • master/tasks_gone
  • master/tasks_gone_by_operator
  • master/tasks_killing
  • master/tasks_unreachable

messages

  • master/invalid_executor_to_framework_messages
  • master/invalid_framework_to_executor_messages
  • master/invalid_status_update_acknowledgements
  • master/invalid_status_updates
  • master/dropped_messages
  • master/messages_authenticate
  • master/messages_deactivate_framework
  • master/messages_decline_offers
  • master/messages_executor_to_framework
  • master/messages_exited_executor
  • master/messages_framework_to_executor
  • master/messages_kill_task
  • master/messages_launch_tasks
  • master/messages_reconcile_tasks
  • master/messages_register_framework
  • master/messages_register_slave
  • master/messages_reregister_framework
  • master/messages_reregister_slave
  • master/messages_resource_request
  • master/messages_revive_offers
  • master/messages_status_update
  • master/messages_status_update_acknowledgement
  • master/messages_unregister_framework
  • master/messages_unregister_slave
  • master/messages_update_slave
  • master/recovery_slave_removals
  • master/slave_removals/reason_registered
  • master/slave_removals/reason_unhealthy
  • master/slave_removals/reason_unregistered
  • master/valid_framework_to_executor_messages
  • master/valid_status_update_acknowledgements
  • master/valid_status_updates
  • master/task_lost/source_master/reason_invalid_offers
  • master/task_lost/source_master/reason_slave_removed
  • master/task_lost/source_slave/reason_executor_terminated
  • master/valid_executor_to_framework_messages
  • master/invalid_operation_status_update_acknowledgements
  • master/messages_operation_status_update_acknowledgement
  • master/messages_reconcile_operations
  • master/messages_suppress_offers
  • master/valid_operation_status_update_acknowledgements

evqueue

  • master/event_queue_dispatches
  • master/event_queue_http_requests
  • master/event_queue_messages
  • master/operator_event_stream_subscribers

registrar

  • registrar/state_fetch_ms
  • registrar/state_store_ms
  • registrar/state_store_ms/max
  • registrar/state_store_ms/min
  • registrar/state_store_ms/p50
  • registrar/state_store_ms/p90
  • registrar/state_store_ms/p95
  • registrar/state_store_ms/p99
  • registrar/state_store_ms/p999
  • registrar/state_store_ms/p9999
  • registrar/state_store_ms/count
  • registrar/log/ensemble_size
  • registrar/log/recovered
  • registrar/queued_operations
  • registrar/registry_size_bytes

allocator

  • allocator/allocation_run_ms
  • allocator/allocation_run_ms/count
  • allocator/allocation_run_ms/max
  • allocator/allocation_run_ms/min
  • allocator/allocation_run_ms/p50
  • allocator/allocation_run_ms/p90
  • allocator/allocation_run_ms/p95
  • allocator/allocation_run_ms/p99
  • allocator/allocation_run_ms/p999
  • allocator/allocation_run_ms/p9999
  • allocator/allocation_runs
  • allocator/allocation_run_latency_ms
  • allocator/allocation_run_latency_ms/count
  • allocator/allocation_run_latency_ms/max
  • allocator/allocation_run_latency_ms/min
  • allocator/allocation_run_latency_ms/p50
  • allocator/allocation_run_latency_ms/p90
  • allocator/allocation_run_latency_ms/p95
  • allocator/allocation_run_latency_ms/p99
  • allocator/allocation_run_latency_ms/p999
  • allocator/allocation_run_latency_ms/p9999
  • allocator/roles/shares/dominant
  • allocator/event_queue_dispatches
  • allocator/offer_filters/roles/active
  • allocator/quota/roles/resources/offered_or_allocated
  • allocator/quota/roles/resources/guarantee
  • allocator/resources/cpus/offered_or_allocated
  • allocator/resources/cpus/total
  • allocator/resources/disk/offered_or_allocated
  • allocator/resources/disk/total
  • allocator/resources/mem/offered_or_allocated
  • allocator/resources/mem/total

Mesos slave metric groups

  • resources
    • slave/cpus_percent
    • slave/cpus_used
    • slave/cpus_total
    • slave/cpus_revocable_percent
    • slave/cpus_revocable_total
    • slave/cpus_revocable_used
    • slave/disk_percent
    • slave/disk_used
    • slave/disk_total
    • slave/disk_revocable_percent
    • slave/disk_revocable_total
    • slave/disk_revocable_used
    • slave/gpus_percent
    • slave/gpus_used
    • slave/gpus_total,
    • slave/gpus_revocable_percent
    • slave/gpus_revocable_total
    • slave/gpus_revocable_used
    • slave/mem_percent
    • slave/mem_used
    • slave/mem_total
    • slave/mem_revocable_percent
    • slave/mem_revocable_total
    • slave/mem_revocable_used
  • agent
    • slave/registered
    • slave/uptime_secs
  • system
    • system/cpus_total
    • system/load_15min
    • system/load_5min
    • system/load_1min
    • system/mem_free_bytes
    • system/mem_total_bytes
  • executors
    • containerizer/mesos/container_destroy_errors
    • slave/container_launch_errors
    • slave/executors_preempted
    • slave/frameworks_active
    • slave/executor_directory_max_allowed_age_secs
    • slave/executors_registering
    • slave/executors_running
    • slave/executors_terminated
    • slave/executors_terminating
    • slave/recovery_errors
  • tasks
    • slave/tasks_failed
    • slave/tasks_finished
    • slave/tasks_killed
    • slave/tasks_lost
    • slave/tasks_running
    • slave/tasks_staging
    • slave/tasks_starting
  • messages
    • slave/invalid_framework_messages
    • slave/invalid_status_updates
    • slave/valid_framework_messages
    • slave/valid_status_updates

You can learn more about Apache Meso metrics on their documentation page.

For more information, please check out the documentation.

Project URL   Documentation

Related Resources

Customer Case Study: Oracle

How Oracle created their Mesos stack monitoring using InfluxDB to monitor cluster utilization and thereby control cost.

Infrastructure and application monitoring

How to monitor your entire infrastructure stack, including servers, containers, databases and cloud services.

Why InfluxDB for Kubernetes monitoring?

Real-time visibility into your entire container-based environment to unify all your metrics and events for faster root cause analysis.

Scroll to Top