> For the complete documentation index, see [llms.txt](https://upsolver.gitbook.io/content/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://upsolver.gitbook.io/content/reference-1/monitoring/job-status/stream-and-file-sources/monitoring-v1/cluster.md). # Cluster The following metrics will help you diagnose performance issues with the cluster upon which your job is running, and guide you in troubleshooting any issues. ### Utilization Percent {% tabs %} {% tab title="Metric" %}


Metric type	Warning
About this metric	Represents how much of the server's processing capacity is in use.
Limits	Error when > 90% Warn when > 70%
Timeframe	Now

#### More information Represents how much of the server's processing capacity is in use. {% endtab %} {% tab title="See All Events (SQL Syntax)" %} Run the following SQL command in a query window in Upsolver, replacing **\** with the Id for your cluster. For additional columns, alter this statement and use `SELECT *`. {% code overflow="wrap" %} ```sql SELECT STRING_FORMAT('{0,number,#,###.##}%', utilization_percent * 100) AS utilization_percent, IF_ELSE(utilization_percent > 0.9, 'ERROR', IF_ELSE(utilization_percent > 0.7, 'WARNING', 'OK')) AS utilization_percent_severity FROM system.monitoring.clusters WHERE cluster_id = ''; ``` {% endcode %} {% endtab %} {% tab title="Troubleshooting" %} Consider increasing the server limit or splitting moving larger jobs to a separate compute cluster. {% endtab %} {% endtabs %} ### Tasks in Queue {% tabs %} {% tab title="Metrics" %}


Metric type	Informational
About this metric	The number of job tasks pending execution in the cluster queue.
Timeframe	Now

#### More information The number of job tasks pending execution in the cluster queue, which represents the amount of work not currently being processed in the cluster because the cluster is at high utilization. This number can be above 0 even if the [Utilization Percent](#utilization-percent) is below 100. This is because of the distribution of work between servers. It might be that work is allocated to a specific server, which is at 100% utilization, but the cluster itself is not at 100% utilization and there are other servers that have free slots, but they are not going to be doing this particular work, so the Tasks in Queue value might be greater than 0. Upsolver will do re-balancing to ensure this doesn't continue over time. The Tasks in Queue value can be smaller than the number of [Job executions currently in queue](/content/reference-1/monitoring/job-status/stream-and-file-sources/monitoring-v1/job-execution-status.md#job-executions-currently-in-queue) because the cluster only adds tasks to the queue in chunks, it doesn’t add all of the tasks. For example, if you want to replay 1,000,000 tasks, the number of tasks in [Job executions currently in queue](/content/reference-1/monitoring/job-status/stream-and-file-sources/monitoring-v1/job-execution-status.md#job-executions-currently-in-queue) will show 1,000,000, but Tasks in Queue will only show the next chunk of work, e.g. 1,000. Therefore Tasks in Queue may only show a subset of the number of [Job executions currently in queue](/content/reference-1/monitoring/job-status/stream-and-file-sources/monitoring-v1/job-execution-status.md#job-executions-currently-in-queue) tasks. {% endtab %} {% tab title="See All Events (SQL Syntax)" %} Run the following SQL command in a query window in Upsolver, replacing **\** with the Id for your cluster. For additional columns, alter this statement and use `SELECT *`. {% code overflow="wrap" %} ```sql SELECT STRING_FORMAT('{0,number,#,###}', tasks_in_queue) AS tasks_in_queue, 'OK' AS tasks_in_queue_severity FROM system.monitoring.clusters WHERE cluster_id = ''; ``` {% endcode %} {% endtab %} {% endtabs %} ### Memory Load Index (%GC) {% tabs %} {% tab title="Metric" %}


Metric type	Warning
About this metric	The percentage of time that the server is doing garbage collection rather than working.
Limits	Error when > 10%
Timeframe	Now

#### More information The percentage of time that the server is doing garbage collection rather than working should generally be under 10%. If this value is showing red and the cluster is not doing a lot of work or the cluster has crashes, it’s often indicative that you need bigger servers with more memory. {% endtab %} {% tab title="See All Events (SQL Tasks)" %} Run the following SQL command in a query window in Upsolver, replacing **\** with the Id for your cluster. For additional columns, alter this statement and use `SELECT *`. {% code overflow="wrap" %} ```sql SELECT STRING_FORMAT('{0,number,#,###.##}%', memory_load_percent * 100) AS memory_load_percent, IF_ELSE(memory_load_percent > 0.1, 'ERROR', 'OK') AS memory_load_percent_severity FROM system.monitoring.clusters WHERE cluster_id = ''; ``` {% endcode %} {% endtab %} {% endtabs %} ### Cluster Crashes {% tabs %} {% tab title="Metric" %}


Metric type	Warning
About this metric	How many server crashes happened in the job’s cluster today.
Limits	Error when > 0
Timeframe	Today (midnight UTC to now)

#### More information How many server crashes happened in the job's cluster today. {% endtab %} {% tab title="See All Events (SQL Syntax)" %} Run the following SQL command in a query window in Upsolver, replacing **\** with the Id for your cluster. For additional columns, alter this statement and use `SELECT *`. {% code overflow="wrap" %} ```sql SELECT STRING_FORMAT('{0,number,#,###}', crashes) AS crashes, IF_ELSE(crashes > 0, 'ERROR', 'OK') AS crashes_severity FROM system.monitoring.clusters WHERE cluster_id = ''; ``` {% endcode %} {% endtab %} {% tab title="Troubleshooting" %} Typically crashes are due to memory issues when trying to load large lookup tables. Consider changing the cluster to a type with more RAM. {% endtab %} {% endtabs %} ### Reload from disk percent {% tabs %} {% tab title="Metric" %}


Metric type	Warning
About this metric	The percent of bytes re-loaded into memory from disk.
Limits	Error when > 200% Warn when > 30%
Timeframe	Today (midnight UTC to now)

#### More information The percent of bytes re-loaded into memory from disk. High values indicate more memory is required as many page faults will result in slow processing. The metric represents how much extra work needs to be done in loading data because there is not enough memory on the server. Each server has an in-memory cache, the Reload from disk percent is how many operations are served by the cache versus how many have expired from the cache, and need to be reloaded. Numbers above 0% are fine, but if this is significant and over 200%, the cluster is working too hard due to a lack of memory. {% endtab %} {% tab title="See All Events (SQL Syntax)" %} Run the following SQL command in a query window in Upsolver, replacing **\** with the Id for your cluster. For additional columns, alter this statement and use `SELECT *`. {% code overflow="wrap" %} ```sql SELECT STRING_FORMAT('{0,number,#,###.##}%', reload_from_disk_percent * 100) AS reload_from_disk_percent, IF_ELSE(reload_from_disk_percent > 2, 'ERROR', IF_ELSE(reload_from_disk_percent > 0.3, 'WARNING', 'OK')) AS reload_from_disk_percent_severity FROM system.monitoring.clusters WHERE cluster_id = ''; ``` {% endcode %} {% endtab %} {% tab title="Troubleshooting" %} Change the cluster to a type with more RAM. {% endtab %} {% endtabs %} --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://upsolver.gitbook.io/content/reference-1/monitoring/job-status/stream-and-file-sources/monitoring-v1/cluster.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.