In my last post, I laid out a vision: an AI agent that monitors a server fleet, pulls BMC logs through Redfish when something goes wrong, diagnoses whether it’s a hardware or software issue, and generates an RMA-ready report with the exact failed component, serial number, and recommended action. I said I’d write it up when I had something working.
It’s working.
I built a system that simulates a 55-server datacenter fleet, serves it through a standards-compliant Redfish API, and runs a diagnostic agent powered by a local LLM. It correctly identified every injected failure — memory ECC errors, PSU degradation, NVLink failures, disk predictive alerts, CPU thermal events — with the right component, the right serial number, and the right recommendation. Nine out of nine. No cloud APIs. Everything running on my homelab.
Here’s how.
Simulating a Datacenter
My homelab has four nodes, not four hundred. To test whether an LLM can actually diagnose server hardware failures from BMC data, I needed a realistic fleet — and I needed to control exactly what failures were in it so I could verify the diagnoses.
The solution: a fleet generator that creates a complete Redfish-compliant mock datacenter, and a mockup server that serves it over HTTP exactly the way a real BMC would.
The generator builds systems from a catalog of 10 real-world server templates — Dell PowerEdge R760 and XE9680, Supermicro GPU nodes, Lenovo ThinkSystem, HPE ProLiant. Each template includes the correct BMC type (iDRAC9, IPMI, XCC, iLO6), accurate firmware versions, CPU options, RAM configurations, DIMM counts, and BIOS versions. These aren’t generic placeholders. If you’ve worked with these servers, the specs look right because they are right.
Each generated system gets assigned one of 12 datacenter roles — compute, gpu-training, gpu-inference, database, storage, web, edge, ci-runner, monitoring, cache, message-queue, backup. GPU roles only get assigned to GPU-capable templates, because the generator knows which server models actually have GPUs. It also gives every system a full DIMM inventory — 8 to 32 DIMMs depending on the model, DDR5, with manufacturers (Samsung, SK Hynix, Micron) and realistic part numbers and serial number formats.
Then it injects failures.
Failures That Look Real
This is where the generator earns its keep. It doesn’t just flip a status flag to “Critical.” It builds realistic BMC event log sequences — the kind of escalating pattern that a real hardware engineer would read and immediately recognize.
Seven failure types, each with its own signature:
Memory ECC uncorrectable. Correctable single-bit errors at a specific memory address, count increasing over hours, trending toward the threshold — then an uncorrectable error. The event log tells a story: this DIMM was dying for two days before it finally failed. The specific DIMM in the memory collection gets its health set to Critical, so the serial number and part number are there for the agent to find.
Memory ECC correctable (trending). Same pattern, but caught earlier. The error count is climbing but hasn’t hit uncorrectable yet. This is a warning, not a critical — the right call is proactive replacement, not emergency RMA.
PSU degraded. Voltage drops from 12.0V nominal to 11.5V. Efficiency drops from 94% to 82%. The PSU isn’t dead — it’s failing. The event log shows the voltage drift over time.
Fan failure. Intermittent RPM fluctuations, then speed below threshold, then complete failure with system throttling. The BMC is watching the server cook itself and documenting every step.
Disk predictive failure. A SMART alert with reallocated sector count at 286 — well past the threshold of 50. The drive is redistributing data around bad sectors faster than it should. It’s not dead yet, but the writing is on the wall.
CPU thermal. Temperature warning at 85°C, escalating to critical at 95°C, frequency throttling engaged. This one’s important because it’s not a hardware failure — it’s an environmental or cooling issue. The CPU itself is probably fine. The right answer isn’t RMA, it’s “check the fans, check the heatsink, check the airflow.”
NVLink degradation. GPU CRC errors exceeding threshold, link bandwidth cut to 50%, XID error 74 with 4,172 CRC errors per hour. If you work with multi-GPU servers, you know what XID 74 means. The question is whether the agent does.
For each system, the generator outputs 7 Redfish JSON files — System, Chassis, Manager, log service indexes, event log entries, and memory collection. The default generation: 50 systems on top of 5 hand-crafted originals, totaling 55 servers and 386 JSON files. Deterministic seeding (—seed 42) means the same fleet generates every time, making the diagnostic pipeline regression-testable.
The Architecture: Three Nodes, One Pipeline
The system runs across my homelab, with each node doing what it’s best at:
Mac (control center). The orchestration hub. All source code, configuration, and the fleet generator live here. A CLI tool handles deploying to other nodes, triggering diagnostics, and viewing reports. Nothing executes here — it just coordinates.
Framework laptop (Ubuntu). Hosts the Redfish mockup server. The DMTF’s official Redfish Mockup Server runs in Docker, reading the generated JSON filesystem and serving it as a standards-compliant Redfish API on port 5000. The -D flag enables auto-discovery — any new system directory added to the mockup filesystem appears as a new endpoint automatically. This is the same server the DMTF provides for testing real Redfish client implementations. If the diagnostic agent works against this, it’ll work against real BMCs.
Spark (GPU node). Runs the diagnostic pipeline. Python scripts fetch data from the mockup server over HTTP, perform health triage, and when issues are detected, send the full Redfish payload to Ollama running a 120-billion-parameter model locally. The DGX Spark’s GB10 chip and 128GB of unified memory make running a model this size feasible without any cloud inference costs.
The separation matters. The mockup server simulates the datacenter. The GPU node diagnoses it. The control center ties them together. You could replace the mockup server with real BMC endpoints tomorrow and the diagnostic pipeline wouldn’t change.
How the Diagnostic Agent Works
The agent has two paths, and the split between them is the first key design decision.
The fast path: healthy systems. For each system, the agent first checks Status.Health and Status.HealthRollup on the system endpoint. If both are “OK” and the event log has no Warning or Critical entries, the system is classified as healthy immediately. No LLM call. Done in milliseconds.
This matters because in a real datacenter, 85-95% of servers are healthy at any given time. Sending every healthy system’s full Redfish payload to an LLM would be slow and wasteful. The fast path means the agent only spends inference time on systems that actually need it.
The slow path: unhealthy systems. When a system fails the health check, the agent collects everything. Five Redfish endpoints per system: system identity and specs, chassis state, BMC controller info, the full BMC event log, and the complete DIMM inventory with every slot’s health status, serial number, and part number.
All of this gets bundled into a single prompt and sent to the LLM. The system prompt instructs the model to act as a datacenter hardware engineer and respond with structured JSON — classification (hardware/software/environmental), severity, failed component type and location, serial number, part number, manufacturer, description, evidence array, recommendation, and whether RMA is required.
The model runs locally through Ollama — a 120B parameter model on the Spark’s GPU, temperature 0.1 for consistent outputs. Each diagnosis takes about 14 seconds. At a 15% failure rate across 55 systems, that means 9 systems hit the slow path, and the full fleet scan completes in about two and a half minutes.
What the LLM Actually Does
This is the part I was most uncertain about going in. Rule-based alerting can check if Status.Health == "Critical". But can an LLM actually diagnose like a hardware engineer?
It can. Here’s what it does that goes beyond simple threshold checking:
Cross-references data sources. When the event log mentions “DIMM B7,” the LLM connects that to the specific DIMM in the memory collection endpoint. It extracts the serial number, part number, and manufacturer. No single Redfish endpoint contains this correlation — you have to read the event log and the memory inventory together. The LLM does this naturally.
Reads escalation patterns. It understands that correctable ECC → count increasing → threshold exceeded → uncorrectable is a dying DIMM, not a transient glitch. This is the kind of pattern recognition that separates “something is wrong” from “here’s exactly what’s wrong and what’s going to happen next.”
Makes engineering judgment calls. CPU thermal gets rma_required: false because overheating is a cooling problem, not a dead component. PSU voltage drop gets rma_required: true because the unit is failing. NVLink CRC errors get specific recommendations about verifying physical connections and updating firmware before replacing the GPU. These aren’t template responses — they reflect the kind of reasoning a hardware engineer applies.
Produces actionable output. Not “contact support” but “Replace the P1-DIMMB7 DIMM immediately and submit it for RMA; run full memory diagnostics after replacement.” The report includes everything a vendor needs: server serial number, failed component, part number, evidence from the BMC logs, and a clear description of the failure.
The Results
From the test run against the full 55-system fleet:
| Metric | Value |
|---|---|
| Total systems | 55 (5 hand-crafted + 50 generated) |
| Healthy systems | 46 (instant, no LLM call) |
| Unhealthy systems | 9 (diagnosed via LLM) |
| LLM inference time | ~14 seconds per unhealthy system |
| Total diagnostic run | ~2.5 minutes for all 55 systems |
| Correct diagnoses | 9/9 |
| Cloud API cost | $0 |
Every failure type was correctly diagnosed with the right component, the right location, and the right action:
- Memory uncorrectable (3 systems) — Identified exact DIMM slot, pulled serial number and manufacturer from memory collection, flagged RMA, cited the escalation from correctable to uncorrectable.
- Memory correctable trending (2 systems) — Classified as warning (not critical), noted the error count trend, recommended proactive replacement before it goes uncorrectable.
- NVLink degradation — Identified GPU 3 specifically, cited XID error 74 and CRC error count, recommended verifying NVLink connections and firmware update alongside GPU replacement.
- Disk predictive failure — Caught the SMART alert with reallocated sector count (286 vs threshold of 50), flagged for replacement.
- PSU degraded — Noted specific voltage (11.5V vs 12.0V) and efficiency (82% vs 94%), flagged RMA.
- CPU thermal — Correctly set
rma_required: false(cooling problem, not dead hardware), recommended checking fans, heatsink, airflow, and thermal paste.
The CPU thermal diagnosis is the one I find most telling. A rule-based system would see “Critical” and generate an alert. The LLM understood why it was critical and knew the CPU itself wasn’t the problem. That’s diagnosis, not just detection.
The Reports
The system generates three types of output:
Per-system JSON reports. Machine-readable, structured for integration with ticketing systems or dashboards. Includes server metadata (vendor, model, serial, hostname, asset tag), the full diagnosis object, and timestamps.
Per-system text reports. Human-readable, formatted for operations teams or RMA paperwork. Sections: System Information, Diagnosis, Failed Component (with serial/part/manufacturer for ordering replacements), Evidence (the specific log entries the LLM cited), full BMC Event Log for context, and Recommendation.
Fleet summary. A single-page overview of every system:
[ . ] Dell-R760-Compute-1 Healthy
[ . ] SM-GPU-Train-1 Healthy
[!!!] Dell-R660-DB-3 Memory ECC Uncorrectable (RMA)
[ ! ] HPE-DL380-Web-7 Memory ECC Correctable Trending
[!!!] SM-4U-GPU-2 NVLink Degradation (RMA)
[ . ] Lenovo-SR650-Cache-1 Healthy
Reports are organized by date: reports/2026-03-02/SM-Edge-1U.json, reports/2026-03-02/summary.txt. Walk away, come back, and the reports are waiting.
Continuous Monitoring
A one-time fleet scan is useful. Continuous monitoring is what makes it operational.
The watcher daemon polls all systems at a configurable interval (default 30 seconds), maintaining state in a JSON file that tracks last-known health for every system. When a system’s health degrades between polls — OK to Warning, OK to Critical, Warning to Critical — the watcher triggers a diagnostic run for that specific system. It doesn’t re-diagnose every system on every cycle, only the ones that changed.
On first startup, it checks for any already-unhealthy systems and diagnoses those. After that, it’s event-driven: something changes, the agent investigates, a report appears.
Start the watcher, walk away, come back to reports for anything that went wrong — with the exact timestamp of when the health changed. This is the “2 AM scenario” from the previous post, actually working.
Design Decisions Worth Explaining
Local LLM, zero cloud costs. Every diagnosis runs on Spark’s GPU through Ollama. No API keys, no rate limits, no per-token billing. The tradeoff is ~14 seconds per diagnosis versus ~2 seconds with a cloud API. But at 15% failure rate, only 9 out of 55 systems hit the LLM. The healthy majority completes instantly. For the unhealthy minority, 14 seconds to get a correct diagnosis with no ongoing cost is a trade I’ll make every time.
Structured JSON output. The system prompt constrains the LLM to respond with a specific JSON schema. This makes the output machine-parseable for report generation, ticketing integration, or dashboard display — not just human-readable text. It’s the difference between building a prototype and building something that could plug into real infrastructure tooling.
Deterministic test fleet. The generator uses a seeded random number generator. --seed 42 always produces the same 55 systems with the same failures. This makes the diagnostic pipeline regression-testable: change the LLM model, adjust the prompt, modify the triage logic, and compare results against a known baseline. Without this, you’re testing against a moving target.
Standards-based throughout. The entire mockup follows the DMTF Redfish schema. The diagnostic agent talks to a standard Redfish API. If you replaced the mockup server with real BMC endpoints — real iDRAC, real iLO, real XCC — the agent code wouldn’t change. That’s the point. This isn’t a demo that only works against its own test data. It’s built on the same standard that every major server vendor implements.
What This Proves
The question I was trying to answer was simple: can an LLM reliably diagnose server hardware failures from real BMC data?
The answer is yes. Not “sort of” or “with caveats.” Nine out of nine failures correctly identified, with the right component, the right serial number, the right recommended action, and the right judgment about what does and doesn’t need an RMA. The LLM doesn’t just detect that something is wrong — it understands what is wrong, why it’s wrong, and what to do about it.
This is a simulated fleet with injected failures. The next step is real BMC endpoints — machines that are actually failing in unpredictable ways with messy, real-world event logs. That’s a harder problem. But the foundation is here: the Redfish client, the diagnostic pipeline, the report generator, the watcher daemon. The architecture doesn’t change. The data gets messier.
The previous post asked “what if the first responder wasn’t a person?” Now I have a prototype that shows what that actually looks like. The first responder pulls the BMC logs, reads them like a hardware engineer would, identifies the failed DIMM down to the serial number, and has the RMA report ready before anyone’s finished their coffee.
I’ll open-source this when it’s cleaned up. In the meantime — if you manage servers and you’ve filed an RMA, you know how much of that process is just collecting information that already exists in the BMC. That’s exactly the kind of work an LLM should be doing.