Everyone’s building AI agents that write code, search the web, and answer questions. Nobody’s building the one I actually want: an AI that manages physical infrastructure.
I’m not talking about dashboards or alerting rules. I’m talking about an agent that knows every server in a fleet, notices when one stops responding, pulls the BMC logs through Redfish, diagnoses whether it’s a hardware failure or an OS issue, reboots it if it makes sense, creates an RMA if it doesn’t, documents everything, and tracks the pattern over time. An agent that operates infrastructure the way an experienced engineer does — but doesn’t sleep, doesn’t forget, and doesn’t let tickets fall through the cracks.
The pieces to build this exist today. The fact that nobody has stitched them together is the opportunity.
The Problem at Scale
When you manage a handful of servers, a failure is an event. You notice it, you investigate, you fix it. When you manage thousands, failures aren’t events — they’re a constant background process. At any given moment, some percentage of your fleet is degraded, failing, or offline. The question isn’t if something is broken, it’s what’s broken right now and how bad is it.
The current state of the art for handling this is a patchwork: monitoring tools that fire alerts, ticketing systems that track issues, runbooks that describe diagnostic steps, and engineers who stitch it all together manually. When a node drops off, someone gets paged. That someone SSHs in — if it’s still reachable — or walks over to a console. They check logs. They look at hardware indicators. They make a judgment call: is this a software issue I can fix, or a hardware issue that needs an RMA?
This workflow is well-understood. It’s also incredibly labor-intensive, and it scales poorly. The number of failures grows linearly with fleet size, but the engineer’s ability to context-switch and track parallel investigations doesn’t. Things get missed. Patterns go unnoticed. RMAs take days to file because someone has to manually collect serial numbers, error codes, and failure descriptions. Nodes sit in a degraded state longer than they should because the queue of problems is always longer than the queue of people.
What if the first responder wasn’t a person?
Redfish Changes the Game
Here’s what makes this buildable now rather than five years from now: Redfish.
Redfish is a standardized REST API for server management, developed by the DMTF and adopted by basically every major server vendor — Dell, HPE, Lenovo, Supermicro, all of them. It exposes the BMC (baseboard management controller) over HTTP, which means you can programmatically access the same out-of-band management interface that engineers use to diagnose hardware remotely.
Through Redfish, you can:
- Query hardware health. CPU status, memory DIMM status, power supply state, fan speeds, temperatures, drive health. If a component is degraded or failed, Redfish knows about it.
- Pull system event logs. The BMC records hardware events independently of the OS. If a server crashes and the OS is unreachable, the BMC log still has the story — uncorrectable memory errors, PCIe link failures, power faults, thermal shutdowns.
- Check power state and boot status. Is the server on, off, or stuck in a boot loop? Is it sitting at a BIOS error screen?
- Perform actions. Power cycle, force restart, boot to a specific target. No physical access required, no out-of-band KVM session needed.
The key insight: Redfish gives an AI agent the same diagnostic access that a hardware engineer has, without requiring the server’s OS to be functional. Even if the machine is completely hung — kernel panic, boot failure, whatever — the BMC is still running, still logging, still accessible over the network.
An LLM doesn’t need a Redfish SDK or a custom integration. It needs to make HTTP requests and interpret JSON responses. That’s it. These are things language models are already excellent at.
One Agent Can’t Do This
Before I walk through the workflow, I need to address something: this isn’t a single-agent problem.
Think about what’s being asked. Constant monitoring across potentially thousands of nodes. Deep diagnostic reasoning for each individual failure. Careful decision-making about when to reboot vs. escalate. Long-term pattern analysis across the fleet. Capacity forecasting. These are fundamentally different cognitive modes operating on completely different timescales.
Asking one agent to simultaneously watch a fleet, deeply investigate a DIMM failure, and analyze quarterly failure trends is like asking one engineer to do NOC monitoring, incident response, and capacity planning all at the same time. You wouldn’t do that to a person. You shouldn’t do it to an agent.
This is a multi-agent problem. Here’s how I’d break it down:
The Watcher. Lightweight, always running, focused on one thing: detecting that something went wrong. It monitors heartbeats, health checks, and alerting signals across the fleet. It doesn’t diagnose anything — it doesn’t have the context or the time. When it detects a failure, it dispatches. Think of it as the NOC operator who notices a red light on the board and picks up the phone.
Diagnostic Agents. Spun up on demand, one per incident. When the Watcher flags a node, a diagnostic agent is created with a single job: figure out what’s wrong with this specific server. It hits the Redfish API, pulls BMC logs, checks OS-level logs if reachable, and produces a diagnosis. It has deep focus on one problem. When it’s done, it reports back and terminates. Ten nodes go down at once? Ten diagnostic agents spin up in parallel. No queue, no context-switching.
The Fleet Intelligence Agent. This one operates on a completely different timescale. It doesn’t care about individual incidents in real time — it cares about patterns over weeks and months. It ingests the output from every diagnostic agent: what failed, where, when, what the root cause was. It’s looking for things like “this drive model is failing at 3x the normal rate” or “servers deployed in Q3 are showing early PSU degradation.” It also handles capacity questions: how much headroom do we have, when do we need to order, what happens if we lose a rack.
The Action Agent. This one is deliberately separated from diagnosis because the consequences of actions are different from the consequences of analysis. The diagnostic agent says “this node has a dead DIMM and should be RMA’d.” The action agent is the one that actually files the RMA, creates the ticket, updates the inventory, and tracks the replacement lifecycle. Separating action from diagnosis means you can put different trust boundaries and approval gates around each.
The agents communicate through shared state — an incident log, a fleet inventory, a failure database. The Watcher writes “node X is down.” The diagnostic agent writes “node X has uncorrectable memory errors on DIMM A3, serial number XYZ.” The action agent writes “RMA filed with vendor, case number 12345, ETA 5 business days.” The fleet intelligence agent reads all of it and writes “memory failures in rack 7 are trending 40% above baseline this quarter.”
Each agent is simple. The system is powerful.
2 AM: A Node Goes Down
Let me walk through what this actually looks like in practice.
It’s 2 AM. Node srv-0847 in rack 12 stops responding to health checks. Here’s what happens:
00:00 — The Watcher detects the failure. Three consecutive health checks missed. The Watcher logs the event and spins up a diagnostic agent for srv-0847.
00:01 — The diagnostic agent starts with Redfish. Before trying SSH — which probably won’t work if the node is truly down — it queries the BMC at the node’s out-of-band management IP. The server is powered on but the OS isn’t responding. The agent pulls the system event log.
00:02 — BMC logs tell the story. The event log shows a sequence: correctable memory errors on DIMM B7 escalating over the past 48 hours, then an uncorrectable error 12 minutes ago, then a machine check exception, then the system went unresponsive. The diagnostic agent classifies this as a hardware failure — memory, DIMM B7, uncorrectable. Rebooting won’t fix this. Even if the system comes back, it’ll fail again.
00:03 — The diagnostic agent reports. It writes a structured incident report: hardware failure, memory, DIMM B7, serial number HMA84GR7CJR4N-WM-T4, part number P03052-091. The BMC event log is attached. Diagnosis: the DIMM needs to be replaced. Recommendation: do not reboot, file RMA, remove node from active pool.
00:04 — The action agent picks it up. It reads the diagnosis, validates that the failure classification meets the threshold for automated RMA (hardware failure with clear component identification — this isn’t a maybe). It pulls the server’s asset information from Redfish: serial number, model, warranty status. It compiles the RMA package: failure description, component details, BMC logs, timestamps.
00:05 — RMA filed, node tracked. The RMA ticket is created. The node is flagged in the fleet inventory as “offline — pending hardware replacement.” The action agent logs the full lifecycle entry: when the failure was detected, what was found, what action was taken, what the next step is.
Meanwhile — the fleet intelligence agent takes note. This is the third DIMM failure in rack 12 this month. The first two were different DIMMs in different servers, but same rack, same DIMM manufacturer, same purchase batch. Not enough data to draw a conclusion yet, but it’s flagged as a trend to watch. If a fourth failure hits, it’ll escalate a proactive review of all nodes in that batch.
Total time from failure to filed RMA: five minutes. No human woke up. The engineer finds a clean incident report in the morning with a diagnosis, an RMA case number, and a note that the fleet intelligence agent is watching for a potential batch issue.
Compare that to the status quo: alert fires at 2 AM, engineer wakes up, spends 20 minutes getting context, runs diagnostics, realizes it’s hardware, files the RMA the next business day because the vendor portal is annoying and it can wait. Three days later the node is still sitting in the “known broken” pile.
Where the Human Fits
I should be honest about this: I wouldn’t want a fully autonomous version of this system on day one. And I don’t think you should either.
There’s a spectrum of autonomy, and different parts of this workflow warrant different levels of trust:
High autonomy — let the agent handle it. Detection, diagnostics, log collection, report generation. These are read-only operations. The worst case if the agent gets it wrong is a bad diagnosis, which a human can catch during review. Let the agent do this unsupervised.
Supervised autonomy — agent proposes, human approves. Reboots, RMA filings, removing nodes from the active pool. These have real consequences. A bad reboot can disrupt a service. A wrong RMA wastes vendor relationship capital and time. For these, the agent should do all the work — compile the diagnosis, draft the RMA, recommend the action — and then present it for human approval. One click to approve, not thirty minutes of manual work.
Human-only — agent advises. Decisions about proactive fleet-wide actions. “I think we should replace all DIMMs from batch X based on failure trends” is a recommendation that involves cost, planning, and risk assessment. The agent surfaces the data and the pattern. The human makes the call.
Over time, as you build trust, the boundaries shift. The agent that reliably diagnoses DIMM failures for three months earns the right to file RMAs without asking. The agent that correctly predicts drive failures earns a longer leash on proactive maintenance recommendations. Trust is earned incrementally, not granted upfront.
The point isn’t full automation. The point is taking the tedious, time-consuming parts — the diagnostics, the data collection, the RMA paperwork, the pattern detection — off the engineer’s plate so they can focus on the decisions that actually need human judgment.
Seeing Patterns Humans Miss
This is where the fleet intelligence agent earns its keep. A single server failing is an incident. A hundred servers failing the same way is a trend. But trends are hard to spot when incidents are tracked as isolated tickets by different engineers at different times.
An agent tracking every failure across a fleet has something no individual engineer has: the complete picture. It can notice things like:
- “Twelve servers in rack 7 have had memory errors in the last month. That’s 3x the rate of other racks. Is there an environmental issue — temperature, vibration, power quality?”
- “We’ve replaced the same drive model 40 times this quarter across the fleet. The failure rate is significantly above baseline. Is this a bad batch? Should we proactively replace the remaining units before they fail?”
- “Nodes that were deployed in a specific batch six months ago are starting to show PSU failures at a higher rate. The warranty window closes in three months.”
- “Every time we push this specific OS update, we see a spike in kernel panics on servers with this particular NIC firmware version.”
Pattern detection across large fleets is a data analysis problem, and it’s one that LLMs combined with structured failure data are well suited for. The diagnostic agents produce structured incident reports. The fleet intelligence agent reads all of them and thinks on a longer timescale.
The difference between reactive and proactive maintenance is exactly this: knowing that something is going to fail based on fleet-wide patterns, rather than waiting for it to fail and then responding.
Capacity Planning Falls Out Naturally
If the agent system has a live inventory of every node in the fleet — what’s healthy, what’s degraded, what’s offline, what’s underutilized, what’s maxed out — it can answer questions that are currently hard to get answers to:
- “How much spare capacity do we actually have right now, accounting for the 47 nodes currently down for maintenance or waiting on RMAs?”
- “At our current growth rate and our current failure rate, when do we need to order more hardware?”
- “If we lose another rack due to a power event, do we have enough headroom to absorb the load?”
- “These 200 nodes are running at 15% utilization. These 50 are running at 95%. What should we rebalance?”
Today, answering these questions requires pulling data from multiple systems, normalizing it, and doing analysis that usually lives in a spreadsheet somewhere. An agent system that’s already tracking fleet state in real time can just answer. It has the inventory, it knows the health status, it can calculate utilization. Capacity planning isn’t a separate project — it’s a natural output of the same data the agents are already collecting for failure management.
Why Nobody’s Built This Yet
The components all exist. Redfish is a mature standard. LLMs can interpret logs and make HTTP calls. Monitoring systems generate the signals. So why hasn’t someone put this together?
Part of it is organizational. Infrastructure teams and AI teams rarely overlap. The people who understand Redfish and BMC logs aren’t typically the ones experimenting with AI agents. And the people building AI agents are focused on developer tools, customer support, and knowledge work — not server hardware.
Part of it is trust. Giving an AI agent the ability to power-cycle production servers is a big ask. The consequences of getting it wrong are tangible and immediate, which makes people conservative — even when the consequences of doing nothing (nodes sitting broken for days) are worse.
But I think the biggest reason is that this problem space isn’t visible. Most of the tech industry interacts with servers through APIs and cloud consoles. The physical layer — the BMC, the DIMM errors, the RMA process — is hidden behind abstractions. If you’ve never been the person who has to figure out why 30 nodes dropped at 2 AM, you don’t know this problem exists.
It exists. And it’s a problem that scales linearly with fleet size while the humans managing it don’t.
What I Want to Build
I want to prototype the core loop: detect a failure, pull BMC data through Redfish, diagnose the issue, and generate an actionable report.
The minimum version:
- A watcher that detects a node going unresponsive.
- A diagnostic agent that queries Redfish, pulls BMC event logs, and classifies the failure.
- Structured output: hardware vs. software, failed component, serial/part numbers, recommended action.
- If hardware: a generated RMA-ready report with everything the vendor needs.
- If software: attempt a reboot, monitor the result, escalate if it recurs.
My homelab is the starting point. Not all of my nodes have full Redfish/BMC support — but the architecture should work the same whether it’s four nodes or four thousand. The hard part isn’t the scale. It’s getting the diagnostic reasoning right — can an LLM reliably read BMC event logs and correctly identify that DIMM B7 threw an uncorrectable ECC error versus a transient PCIe timeout?
If I can get an agent to reliably diagnose a hardware failure from real BMC logs, pull the right serial numbers, and produce a report that an engineer would trust enough to approve — that’s the proof of concept. Everything else is scaling it up and connecting more data sources.
I’ll write it up when I have something working.