Monitoring | AllegroGraph 8.5.0

Overview

AllegroGraph provides server monitoring via a set of HTTP endpoints. The server exposes information about internal worker processes (backends and sessions), active queries (jobs), and storage reports for each triple store. Additionally, the server exposes a log of critical server events (audit log).

Prometheus metrics

AllegroGraph's main monitoring/observability endpoint is /metrics, which exposes various metrics in Prometheus format. This endpoint requires superuser permissions.

In order to reduce computational load on the system, the /metrics enpoints supports requesting metrics of a particular kind (category). Currently supported categories are:

system: comprehensive system metrics including CPU usage, memory utilization, disk I/O, network activity, and AllegroGraph-specific connection counts (backend, sessions, etc);
jobs: information about active SPARQL queries (jobs) including total count and age statistics (maximum, minimum, and average age in seconds);
queries: information about queries, like the total number of queries executed per active triple store, cumulative time, number of running queries etc.
indices: reports on repository index health including total indexed triples and optimization scores for each index by class; only returns metrics for the repositories currently in operation to avoid starting the dormant ones;
replication: monitors Multi-Master Replication status including commits behind primary, ingest queue length, controlling status, and replication state for each repository; like indices, only returns metrics for repositories currently in operation.

New metrics kinds (categories) will be added in the future. Multiple kinds (categories) of metrics can be retrieved at once. For example the following call

curl -u test:xyzzy https://<ag-host>:<ag-port>/metrics?kind=queries

will only return queries metrics, while the call

curl -u test:xyzzy https://<ag-host>:<ag-port>/metrics?kind=system&kind=indices&kind=queries

will return system, indices and queries metrics in a single response.

Tokens obtained from external authenticators like OIDC or LDAP can be used to avoid putting user credentials into Prometheus config or scripts. The same example as above but with an external token:

curl -H 'Authorization: Basic <OIDC or LDAP token>' https://<ag-host>:<ag-port>/metrics?kind=queries  
curl -H 'Authorization: Bearer <OIDC token>' https://<ag-host>:<ag-port>/metrics?kind=queries

All metrics follow Prometheus naming conventions and include labels for catalog and repository filtering where applicable. For example, indices metrics are reported for each repository, so the labels catalog and repository can be used to filter metrics for a particular repository.

List of exposed Prometheus metrics

The following is a list of all Prometheus-compatible metrics exposed by AllegroGraph, grouped by kind:

indices:
- allegrograph_index_oscore, gauge - OScore for each triple index in the repository .
- allegrograph_indexed_triples, gauge - Total number of indexed triples in the repository.
jobs:
- allegrograph_active_jobs, gauge - Total number of active jobs.
- allegrograph_active_jobs_age_seconds, gauge - Maximum, minimum and average age of jobs in seconds.
queries:
- allegrograph_queries_cache_hits_total, counter - Number of SPARQL queries read from results cache.
- allegrograph_queries_cache_misses_total, counter - Number of SPARQL queries bypassing the results cache.
- allegrograph_queries_duration_seconds_total, counter - Total duration of executed SPARQL queries.
- allegrograph_queries_failed_total, counter - Number of failed SPARQL queries.
- allegrograph_queries_total, counter - Total number of executed SPARQL queries.
replication:
- allegrograph_replication_commits_behind, gauge - Number of commits the replica is behind primary.
- allegrograph_replication_controlling, gauge - Whether this repository is controlling (1) or not (0).
- allegrograph_replication_ingest_queue_length, gauge - Number of items waiting in the replication ingest queue.
- allegrograph_replication_state_info, gauge - Current state of replication with state as a label.
system:
- allegrograph_backends_count, gauge - Number of backends.
- allegrograph_cpu_count, gauge - Number of CPUs.
- allegrograph_http_connections, gauge - HTTP connections.
- allegrograph_http_workers, gauge - HTTP workers.
- allegrograph_https_workers, gauge - HTTPS workers.
- allegrograph_memory_anon_bytes, gauge - Anonymous memory in bytes.
- allegrograph_proxy_connections, gauge - Proxy connections.
- allegrograph_server_timestamp, gauge - Server timestamp in milliseconds.
- allegrograph_sessions_count, gauge - Number of sessions.
- allegrograph_vmstat_context_switches_rate, gauge - Context switches per second.
- allegrograph_vmstat_cpu_idle_percent, gauge - CPU idle time percentage.
- allegrograph_vmstat_cpu_steal_percent, gauge - CPU steal time percentage.
- allegrograph_vmstat_cpu_system_percent, gauge - CPU system time percentage.
- allegrograph_vmstat_cpu_user_percent, gauge - CPU user time percentage.
- allegrograph_vmstat_cpu_wait_percent, gauge - CPU wait time percentage.
- allegrograph_vmstat_disk_blocks_in, gauge - Block input rate.
- allegrograph_vmstat_disk_blocks_out, gauge - Block output rate.
- allegrograph_vmstat_interrupts_rate, gauge - Interrupts per second.
- allegrograph_vmstat_memory_buffer_bytes, gauge - Buffer memory in bytes.
- allegrograph_vmstat_memory_cache_bytes, gauge - Cache memory in bytes.
- allegrograph_vmstat_memory_free_bytes, gauge - Free memory in bytes.
- allegrograph_vmstat_memory_swap_used_bytes, gauge - Used swap memory in bytes.
- allegrograph_vmstat_processes_blocked, gauge - Blocked processes.
- allegrograph_vmstat_processes_running, gauge - Running processes.
- allegrograph_vmstat_swap_in_rate, gauge - Swap-in rate.
- allegrograph_vmstat_swap_out_rate, gauge - Swap-out rate.

Example Grafana dashboard for AllegroGraph

An importable JSON description of an example Grafana dashboard for AllegroGraph can be found in the agraph-examples repository on GitHub here. Below are a couple of screenshots.

Grafana System Section

Grafana Server Section

Grafana Repository Section

The included Prometheus configuration that collects all the metrics of all supported kinds supported kinds every 5 seconds is shown below:

scrape_configs:  
  - job_name: allegrograph  
    scrape_interval: 5s  
    metrics_path: /metrics  
    basic_auth:  
      username: test  
      password: xyzzy  
    static_configs:  
      - targets: ['localhost:10035']  
        labels: { kind: 'jobs' }  
      - targets: ['localhost:10035']  
        labels: { kind: 'replication' }  
      - targets: ['localhost:10035']  
        labels: { kind: 'indices' }  
      - targets: ['localhost:10035']  
        labels: { kind: 'system' }  
      - targets: ['localhost:10035']  
        labels: { kind: 'queries' }  
    relabel_configs:  
      - source_labels: [kind]  
        target_label: __param_kind

Standalone monitoring endpoints

AllegroGraph provides standalone endpoints for monitoring certain parts of the system. The most important of these is the Audit Log endpoint /auditLog, which is not available throught the Prometheus-compatible /metrics feature. To read more about the structured system audit log which tracks important changes to the server and its triple-stores please see Auditing.

Other standalone endpoints are legacy endpoints and are either already fully integrated into the Prometheus /metrics endpoint or will be in the future. These include:

/processes. It's useful to list all processes spawned by AllegroGgraph and check their resource (CPU/memory) usage. There are a lot of different tools to check process statistics. AllegroGraph also provides a way to get systemstat info about pid via /systemstat.json?requestTree=sys.processes.pid<pid id>&startUt=<time-in-milliseconds>&id=<integer>. This endpoint is undocumented because it's used only for WebView to render charts on "Processes" (/webview/admin/processes), and "Server stats" (/webview/utils/systemstat) pages.
/jobs. A job is an a SPARQL query execution. The most important field in the response is the ageInSeconds field which can be used to detect queries that run abnormally long. Optionally, a call DELETE /jobs?jobId=<job-id> can be used to cancel a query.
/session. To read more about AllegroGraph sessions please see the Sessions section.
/catalogs/catalog-name/repositories/repository-name/reports. This is a legacy repository reports system which will eventually be fully integrated into the /metrics endpoint and deprecated. If you send this request without the path parameter, you will get the list of supported path values. An example of a path for retrieving the storage layer summary: /repositories/<repo-name>/reports?path=storage.

Please contact Franz Support if you'd like to have a feature monitoring which is not described in the sections above.

AllegroGraph 8.5.0 Monitoring Capabilities

Overview

Prometheus metrics

List of exposed Prometheus metrics

Example Grafana dashboard for AllegroGraph

Standalone monitoring endpoints