Optimization and AI Touchpoints – AI Metrics

SOURCE: Kay Sever | January 29, 2026

After being on an AI learning curve for the past several months, I have decided it’s time to write about my observations, insights and possible ways to selectively apply AI within the optimization framework to create a better result. Therefore, our topic for 2026 will be the intersecting points of Optimization and AI as they relate to productive assets, the organization and the management system. We are told that AI will “know all”, so we expect to find it helpful no matter the question asked or problem to be solved. The more you can learn about AI models (LLMs: large language models) and their strengths and weaknesses, the clearer your vision will be about incorporating an AI model into your workflow. 

When listening to AI experts discuss AI capabilities, system elements and performance metrics is almost like listening to a foreign language. In this article and in future articles, I will do my best to define in simple terms AI-specific terminology. Here’s a partial list of LLMs in the news: Grok, Claude Sonnet, Opus, Deepseek, Gemini, GPT and Owen. 

This month we will review some of the KPIs currently used to quantify LLM capabilities and rank LLMs from best to worst. 

Assessing AI models using KPIs

Any time you consider an equipment purpose, you look at a set of metrics to determine which manufacturer will best meet your needs. Without knowing a unit’s productivity, capacity, expected life and operating cost, you would not be confident that your purchasing decision would be the best match for your needs. If you decide to incorporate AI into workflows, understanding the metrics used by experts to rank LLMs is an important step in that decision. 

LLMs are measured using four kinds of metrics:

  • Comparison KPIs that evaluate LLMs as they compete with each other using pre-determined datasets.
  • Accuracy KPIs that measure the accuracy of responses compared to available reference data.
  • Processing KPIs that measure processing speed and throughput.     
  • System Capacity KPIs that measure system maximums/constraints.  

The following sections list the KPIs under each category with a brief explanation. 

Comparison KPIs

There are three KPIs used to assess speed and “predictive capability” when competing against other LLMs. These KPIs are used to rank LLMs from high to low in performance. LLM matches are scheduled with competitors periodically to test new LLM features and determine if their scores improved.  

Arena Score/Elo 

Arena Score/Elo are metrics used to measure competitiveness of gaming systems (that’s right… gaming systems, not AI). Each system (LLM) starts with an Arena Score of 1500. Depending on how they score against each other and how many matches they play ((+/-30), points are added or subtracted based on mistakes made or consistent wins over number of matches played. Elo is the adjustment calculation for an Arena Score. You may see Arena Scores mentioned in press releases about who is leading the AI race.    

R-Squared 

R-Squared tests AI “knowledge” (the definition of AI knowledge seems to morph thru time). LLMs process datasets and “predict” the answers. The higher the R-Squared score (a decimal <1), the more accurate the LLM is reported to be. (Example: If one LLM scored a .8 and another LLM scored a .5, the .8 score would be viewed as higher/better)  

Accuracy KPIs

When assessing the accuracy of an LLM, several KPIs apply. These KPIs fall into one of three categories: Accuracy, Regression and Generative.  

Accuracy KPIs

If an LLM gets most predictions right but struggles with specific errors, the following KPIs would measure that problem:

  • Accuracy – proportion (ratio) of correct predictions (true and false) to all predictions
  • Precision – percent of true predictions over all predictions
  • Recall (sensitivity) – capturing all positives, reflects thoroughness of searches
  • F1 Score – balances precision and recall. F1 = 2 x (Precision x Recall)/(Precision + Recall)    
  • AUC-ROC – Area Under the Receiver Operating Characteristic Curve. AUC reflects overall performance (1.0 is perfect, .5 is random) 

Regression KPIs

These KPIs measure the variance between predicted continuous values and actual values.

  • MAE – Mean Absolute Error (the average of absolute differences/variances)
  • MSE – Mean Squared Error – squares errors before averaging
  • RMSE – Root Mean Squared Error – square root of MSE
  • R-Squared (Coefficient of Determination) – how much of the variance is explained by the model

Generative KPIs (texts or images)

  • MAP – Mean Average Precision – averages recall precision in detection exercises
  • IOU – Intersection over Union – quantifies overlap between predicted and “ground-truth” bounding boxes (i.e., pre-established test datasets)
  • Perplexity – measures how well text is predicted
  • BLEU – Bilingual Evaluation Understudy – measures n-gram overlap (comparison of sequences of words
  • ROUGE – Recall-Oriented Understudy for Gisting Evaluation – measures recall of n-grams in summaries      

Processing KPIs

These ten metrics reflect input/output stats, speed of processing and productivity. 

  • Latency – the time between receiving an input and producing a complete output, measured in seconds or milliseconds
  • Throughput – TPS (tokens per second) or requests per minute. Tokens are “units of text”.
  • Perplexity – measures the accuracy of the next token predicted vs. the actual next token, lower values = higher accuracy
  • Token Efficiency – number of tokens per task, verbose responses affect this value
  • Inference Time – processing time for computations and “forward passes” through the LLM (excluding preprocessing). Forward passes are processes for passing data through layers of a neural network to generate an output.
  • Memory Usage – Gigabytes of RAM required for processing.
  • FLOPs – Floating Point Operations – Number of computations needed per inference. Often expressed in trillions.
  • BLEU Score – When translating or summarizing text, this measures the n-gram overlap between generated and reference text.
  • ROUGE Score – Measures recall of overlapping n-grams, words or sequences in generated text.
  • Human Evaluation Metrics (Fluency, Relevance, Coherence) – Researchers and analysts use Likert scales to assess AI literacy and the COSMIN system to measure comprehensiveness for health research.   

System Capacity KPIs

These metrics reflect system processing maximums. 

  • Context Window Size – Maximum number of tokens an LLM can process in a single request. Often measured in number of tokens (4k, 128K, etc.)
  • Maximum Throughput – Maximum tokens processed under peak load. Tokens/second in batch mode.
  • Parameter Count – Total number of “trainable weights” that reflect informational capacity (in billions). Note on Trainable weights: LLMs contain layers of information used to response to tokens. Some layers are trainable (changeable/can be updated); others are not. The weighting of layers controls how LLMs rank reference information as most important, less important, not important.
  • Memory Footprint – RAM required to load and run an LLM.
  • FLOPs per Inference – The Floating Point Operations for running one forward pass (expressed in tera-FLOPs or petaFLOPS).
  • Scalability Under Load – Queries per minute before response time slows.
  • Token Limit per Generation – A cap on the number of tokens allowed in a single response.
  • Batch Size Capacity – Maximum number of inputs processed in parallel (hardware limited).

In summary, LLMs are measured in many ways. This introduction into the quantification of LLM performance gives you a window into what AI developers focus on and how that focus could impact your choice of LLM to adopt. You will not need to understand all of these metrics to make a decision about which LLM to use; however, there may be a few on this list that will help you evaluate your needs in a different way, such as:

  • Accuracy of responses compared to other LLMs.
  • How weighting could affect the responses you get from an inquiry about certain topics.
  • Maximum tokens output and how that limit could reduce the value from your research.
  • Number of parallel requests that may need to be handled in your company. 

Next month: Our assumptions about AI capabilities are shaped by the information currently output by different AI systems. Are our expectations for how AI can be used to increase profit valid?

Kay Sever is an Expert on Achieving “Best Possible” Results. Kay helps executive and management teams tap their hidden profit potential and reach their optimization goals. Kay has developed a LIVESTREAM management training/coaching system for Optimization Management called MiningOpportunity – NO TRAVEL REQUIRED. See MiningOpportunity.com for her contact information and training information.

To comment on this story or for additional details click on related button above.

Kay Sever Author
P.O. Box 337 Gilbert, AZ USA 85299-0337

Kay has worked side by side with corporate and production sites in a management/leadership/consulting role for 35+ years. She helps management teams improve performance, profit, culture and change, but does it in a way that connects people and the corporate culture to their hidden potential. Kay helps companies move “beyond improvement” to a state of “sustained optimization”. With her guidance and the MiningOpportunity system, management teams can measure the losses caused by weaknesses in their current culture, shift to a Loss Reduction Culture to reduce the losses, and “manage” the gains from the new culture as a second income stream.