Is AI ready for DevOps?

Download MP3

Bret: Welcome to DevOps and Docker
talk, and I'm your host, Bret.

This episode is a special one.

It's actually the first episode from a
totally new podcast I launched called

Agentic DevOps, and that podcast is
gonna run in parallel with this one.

So this one, the goal is
still for the last six years.

Everything related to containers,
cloud native Kubernetes, and Docker,

and the DevOps workloads around that.

And I don't plan on changing any of that.

We're gonna still have the same guests.

A certain amount of those will be AI
related guests, but I was seeing a trend.

I. That I'll talk about in the show.

And I thought that Agentic DevOps was
going to be a big thing here in 2025.

So a few months back we started working on
content episodes and theming and branding.

A whole new podcast that I recommend
you check out at agenticdevops.Fm

links in the show notes.

And this is the first episode from
that podcast that I'm just presenting

here so that you can check it out.

Neral and I talk theory around what
we see coming and what might be

a huge shift in how we use AI to
do our jobs as DevOps engineers.

And that intention for that show is
to have more guests and to really

dial in and focus on that very
niche topic, at least for this year.

Who knows?

it might be a bigger deal than this show,

so if you enjoy this episode,
subscribe to that second podcast of

mine, and now I'm gonna have two.

So I hope you enjoy.

Welcome to the first episode of
my new podcast, a Agentic DevOps.

this episode.

Is kicking off what I think is going
to be a big topic for my entire year,

probably for the next few years around
wrangling AI into some usable format.

For DevOps, you probably heard of AI
agents by now, or the MCP protocol.

I guess I should just say MCP,
since P stands for protocol.

And these two things together are
creating potentially something

very useful for platform
engineering, DevOps, and that stuff.

it has so much potential that.

In the first quarter of 2025, I kind of
thought this was gonna be a big deal.

This was gonna be, uh, if we can
figure out how to keep these things

from hallucinating and going crazy
in our infrastructure, this could

potentially be the AI shift for
infrastructure that I was waiting for.

So started this podcast.

We recorded our first episode at
KubeCon at the beginning of April,

2025, and this is gonna be a series of
very specific episodes around getting.

Ais to do useful automation and work
for DevOps, platform engineering,

infrastructure management,
cloud, you know, all those things

beyond just writing YAML, right?

So the, the intro for this podcast,
there's a separate episode for intro.

It kind of goes into my whole theory of
why I think this is gonna be a thing.

And this episode we really try to break
down the basics and fundamentals for

those of you that are catching up.

Because it's a lot.

There's a lot going on.

It seems like We have announcements
every day this year around AI agents or

Agentic, ai, however you wanna call it.

I am calling it Agentic DevOps,
and hoping that name will stick.

Now, this episode, since it's
from the beginning of April.

And it is technically now just getting
released at the beginning of June.

We're a little bit behind on
launching this new podcast.

Um, I think everything
in it's still relevant.

There's just been a lot more since.

And I don't know the frequency yet.

I don't know how often this
podcast is gonna happen.

It could be potentially every other week.

It could be weekly.

I just don't know yet because we
are not gonna do the same thing

here as on my usual podcast.

If you're someone who knows that
one DevOps and Docker talk that I've

been doing the last seven years, that
one is still gonna have AI in it.

But this one is very specific and
there might be a few episodes that have

syndication or whatever you wanna call
it, of the episodes on both podcasts.

But most of the time we're gonna
keep the focus of just everything,

DevOps, everything, containers on
the DevOps and Docker talk show.

And this one is gonna be very
specific around implementing useful

AI related things for Agentic DevOps,
or automating our DevOps with robots.

So I hope you enjoyed this episode
with Nirmal from KubeCon London.

Hey, I'm Bret.

And we're at Kon.

We are Hi, Nirmal.

Nirmal: I'm Nirmal Metha.

I'm a principal specialist solution
architect at AWS and these are

my views and not of my employers,
but this episode is all about

Bret: AI

Nirmal: agents

Bret: for DevOps and platform engineering.

Ooh.

So let's just start off real
quick with what is an AI agent?

Okay.

So we've heard of ai, we
know ai, gen, AI chat, GPT.

We've talked about.

running LLMs, running
inference on platforms.

Yep.

And that we are managing the workloads
that provide other people services.

Absolutely.

So how is AI agents different than that?

Nirmal: This is a air in
terms of bleeding edge.

Yeah.

This is it, right?

Yeah.

Like we're a year ago.

No one

Bret: had this

Nirmal: term

Bret: six months ago.

I don't think anybody's

Nirmal: talking about it.

I'm very few people.

Yeah, very few people.

and we've seen it in the news a
lot of vendors and big companies

announcing Agentic ai, that's another
term's ai, so AI agents, Agentic

It's giving your LLM, like your chat,
GPT or your Claude or local LM Lama.

Yeah.

Access to run commands.

On your behalf.

Or on its behalf.

Bret: Yeah.

And we call those tools like
that, if you hear that word.

Tools.

Yeah.

That's like the generic
tool, like I guess a shell.

Could be a tool.

Correct.

Reading a file could be a tool.

Accessing a remote, API of
a web service is a tool.

Yep.

Searching could be a tool.

And so these tools what what makes
that different than what we've

been seeing in our code editors?

Yeah.

How is that different?

Nirmal: I'm a platform engineer
and I want to build out an

EKS cluster using Terraform.

That's what we use.

So I'll ask let's say Claude or chat GBT.

Yeah.

I'm a platform engineer and I want to
build a production ready EKS cluster.

Please create.

The assets I need, and it
will spit out some Terraform.

Yaml, right?

Yeah.

Bret: And it's writing text.

Nirmal: It's writing text.

And I can, I'll double
you a little button.

I copy that.

Put it in, or there'll be, if you're
using Cursor, all these other tools,

you can put it into some TF file.

Yeah.

I can then take that and I can ask
the LM what's the command that I

need to run to apply this Terraform?

To actually stand up the, what it's,
what's described in this terraform.

It'll spit out, okay, you wanna
do Terraform plan and then

Terraform apply and all that.

Terraform in it or whatever, and
I'll just copy those commands and

check 'em and write them myself.

So the LLM is not executing
anything on my behalf.

On, on your behalf.

Agent would be defining a tool set.

So I could give, I could define
a tool called Terraform or a tool

called Shell I could describe what
that tool does in natural language.

Bret: Okay.

Nirmal: And then I can give
the LLM system a list of these

tools and their descriptions.

And tell it.

Okay?

Back to the same scenario.

I'm a platform engineer and I want
to create an EKS production cluster

using Terraform, and I want you
to create it right for me because

it has the access to those tools.

Now it internal reasons, okay,
I need to create some Terraform.

I need to validate it in some
kind of way, and then I need.

I need to execute this Terraform.

Is there any tools that
I have in my toolbox

Bret: In this case, sorry the
i is the, you're referring

to yourself as the ai, right?

Yeah.

Sorry.

It's no longer the
human doing this, right?

No.

We gave it instructions and we sit back

Nirmal: from the perspective, from
the perspective of the, LLM the

Gen AI tool itself, the LLM system
that's the I in this scenario.

Yeah.

I, the LLM is deciding.

The Gen NI tool is looking at its list of
available tools and matching what it needs

to it, figure it, it's reasoning about
what the end goal is and it looks and

says, there's this tool called Terraform
that allows me to use infrastructure as

code to deploy resources on the cloud.

That sounds like what I need.

Maybe.

And it.

Generates the terraform just like
it did the first time around.

It knows what command to run.

It generates the command and then
the magic here, a little box will

show up and says, do you want me
to execute this on your behalf?

You click the button, you click the
button, and then it executes that

Terraform apply Uhhuh and it sounds very
simple, but it's a very different paradigm

in terms of thinking about how we interact
with infrastructure or systems in general.

Like broadly systems in general.

Because we are no, like in this
way of looking at it or thinking

about it, I, as the human, are no
longer executing those commands.

I am.

Trusting to a certain extent that
the LLM can figure out what it needs

to do and giving it a guardrail
set of tools to use and execute.

Bret: Yeah.

And so we're giving the, we're
giving the Chaos monkey XI

mean, it's automation, right?

We could actually classify
this as just automation.

It just happens to be.

Figuring out what to
automate in real time.

Rather than the traditional automation
where we have a very deterministic plan

of, steps that are repeated over and
over again by a GitHub action runner

or a CI CD platform or something.

Yeah.

Nirmal: And the agent part is the
piece of software that enables.

The LLM to execute.

Bret: Yeah.

Nirmal: and pull, pulls this all
together and one, so back to what I was

talking about with the infrastructure
and there was a part where I said,

okay, how do we define what tools are
available for the agent system to use?

and how do I want the
agent to call those tools?

And reason about them, and
there's a protocol called

MCP Model Context Protocol.

Just outlining a standard way of
defining the tools, the system prompt

for that tool and a description.

Bret: And this is like an API where
you like define the spec of an API.

Nirmal: It's a defined spec of an
API and the adoption of that API is

Bret: just exploding right now,

Nirmal: essentially.

Bret: Yeah.

So we're to, to under if you're not,
okay sorry, lemme back up a second.

That's a very valid point because that's
the reason I wanted to record This's

a I don't wanna be a hype machine.

Correct.

But I'm super excited right now.

if you can see inside my, in
my enthusiastic brain, I've

only been paying attention to
this for a little over a month.

If you asked me two months ago
what an AI agent was, I'd say,

I don't know a robot that's ai.

I don't know.

I now think I've got a
much better handle on this.

I've been spending so much of my life
right now, deep diving into this, to

the point that you and I are talking
about changing some of the focus

this year on, on all these topics.

Absolutely.

Because I think this is gonna
dominate the conversation.

This is, these are, there's gonna be
a lot of predictions in this and we're

not gonna talk forever 'cause it's
gonna need to be multiple episodes to

really break down what's going on here.

But we now have the definitions.

AI agents, what are tools?

The protocol behind it is
essentially MCP right now.

Although that's not necessarily gonna be
the only thing, it's just the thing right

now that we're agreeing on by one company.

Exactly.

Nirmal: We have to caveat this with, this
is like this is early like Docker days.

This is like

Bret: Docker in day 60, right?

Yes.

Like we were like right after
Python in 2013 when we gave that

de, when he gave that demo, Solomon.

Like we all saw it and didn't
understand it fully, but it

felt like something right.

And like you and I both, that's why
we were early docker captains, is

we saw that as a platform shift.

we've seen these waves before over,
over our careers of many decades

that we earned with this gray
beard status with effort and toil.

And I feel like this is maybe the moment.

That was the moment of 2013 and that,
and yeah, I'm not alone in that feeling.

yes.

Nirmal: And there's just to be clear,
there's massive differences between

like paradigm shifts in terms of like
virtualization, cloud containers.

And the tooling of software
development and systems development

and right systems operations,
it's still in that same vein, but.

Yeah, we're not replacing,

Bret: this is not replacing infrastructure
or containers or anything like that.

This is just gonna change the way we work.

Nirmal: Correct.

And also it's broader than
just like IT infrastructure.

Like this has implications with
software design or application,

like what an application does.

And I want to think of
this as a teaser trailer.

To subsequent new series, episode.

A new series.

Yeah, absolutely.

We're gonna have to

Bret: come up with a name.

I'm toying around with the idea of
Agentic DevOps, and just classifying

that as the absolutely as the theme
of certain levels of podcast episodes.

You've heard it here first.

Heard it here first.

This

Nirmal: is Agentic DevOps.

Another term we're seeing is AI four ops.

Again, this is early days.

None of this is like

Bret: Yeah.

Set in stone at all.

Yeah, and if you're at KU Con today
with us, if you were here at this

conference all week, AI was a constant
topic, but it wasn't about this.

It actually, there was only one talk in
an entire week that even touched on the

idea of using AI to do the job of an
DevOps or operator or platform engineer.

Like people are, what we're talking
about at KU Con for the last three

years has been how to run the
inference and build the LLM models.

And so we are just still using
human effort to do that work.

But this, I feel like I'm gonna draw the
line in the sand and say, this is the.

month or the definitely
the year, that kicks off.

What will be a multi-year effort of
figuring out how we use automated

LLMs Essentially with access to all
the tools we want to give it with

the proper permissions and only
the permissions we want to give

it right to do our work for us.

In a less chaos mon monkey way, right?

Like less chaotic way.

Potentially.

Potentially.

It could, this thing can
easily go off the rails.

Absolutely.

I will probably reference in the show
notes Solomon Hike's recent talks about

how they're now using Dagger, which
is primarily A-C-I-C-D pipeline tool.

So he's talking, and a lot of my
language is actually from him iterating

on his idea of what this might look
like when we're throwing a bunch

of crazy hallucinating AI into what
we consider a deterministic world.

Correct.

Nirmal: I think with containers and cloud
and on the infrastructure APIs we have.

We were chipping away and really
aiming at deterministic behavior

with respect to infrastructure.

Ironically, maybe not
ironically, I don't know.

Now we're introducing a paradigm
shift that reintroduces a lot

of non-determinism right into.

A place that we have been fighting
to non-determinism for a long time.

Bret: We have been working
to get rid of all that.

And now we're, that's why I keep
saying Chaos monkey, because we're

throwing a wrench into the system.

That is in some ways feels like we're
going back to a world of, I don't

know, what's the status of the system?

I don't know.

and this will probably be another
episode, I feel like this Agentic

approach where we're actually
can have the potential to pit.

The LLMs against each other, right?

And have different
personas of these agents.

One is the validator, one is the tester.

One is one is the builder.

And they can fight amongst each other.

And it all works out.

It actually ha happens to
actually work out better.

And so if you're like me and
for the last three years of

understanding, ever since GPT.

3.5 or whatever came out.

We all saw chat GPT as a product, and
then we started with GoodHub copilot and

we started down this road As a DevOps
person, I haven't had a lot to talk about

because I'm not interested in which model
is the fastest or the most accurate.

'cause you know what?

they all hallucinate and
still even today, years later.

Code agents and we and you can see
this on YouTube, you watch basically

thousands of videos on YouTube of
people trying to use these models to

write perfect code and they just don't.

And so we in ops, but we look at
that, I think, and the people I

talk to even for years now are like,
we're never gonna use that for ops.

But now my opinion has changed.

Yeah.

Nirmal: yeah.

And I. If you're listening to this
and your gut reaction is, wait we

have like APIs that are deterministic.

Like you just

Bret: Yeah.

Nirmal: We can just call an API.

We can have an automation tool call an
API to stand up infrastructure and like,

why do we need to recreate like another
layer that makes it non-deterministic.

And looks like an API but isn't an API
and you don't really know what it might

do or which direction it might go.

Yeah.

And you're feeling I don't know.

That doesn't seem like it would
solve any problems for me.

And it seems like it might
introduce a lot of problems.

You're in the right place because
that's exactly what we're gonna explore.

Bret: Yeah.

Nirmal: one thing for sure
though is it's here, right?

I and so I feel like as good engineers,
as good system admins and operators

Bret: are we enjoy, we love our crafts.

We, we look at this as an.

Art form of brain power and Right.

Reaching for perfectionism in our
YAML and in our infrastructure

optimization and our security.

Nirmal: And we have a healthy
sense of skepticism on new tools,

new processes, new mechanisms.

Yeah.

When you, when availability of your
services is paramount and reliability,

you want to introduce new things in a.

In a prudent manner.

And so we're gonna take that
approach, but we're not going

to dismiss that this exists.

Clearly there's a lot of interest,
energy integration happening,

experimentation happening and some
people are already starting to see value.

Yeah.

and we're gonna explore
with you where that, goes.

Bret (2): Yeah.

This, just to be clear, this is
KubeCon April, 2025 and almost

no one is talking about this yet.

It feels like it's right under the
surface of a lot of conversations and

a lot of people maybe are thinking
about it, but I'm not even sure that

we're honest with ourselves around.

That this is coming,
whether we like it or not.

And only because, yeah, not only, but
one of the large reasons is business.

Okay.

Lemme back up.

You know how in a lot of organizations,
Kubernetes became a mandate, right?

So there's lots of stories that came
out over the course of Kubernetes

lifetime of teams being told that
they need to implement Kubernetes.

It didn't come from a systems engineering
approach of solving a known problem.

It came down.

Because an executive decided that
they read a CIO magazine article

that said Kubernetes was a cool
new thing and they did it right.

I hear this all the time.

I confirm this multiple times this
week with other people, and I now feel

like we're not talking about it yet.

But I did hear multiple analysts say their
organizations that they're working with

expect that we are going to reduce the
number of personnel in infrastructure.

Because of ai.

the only way that's possible
is if we use agents to our

advantage, because we can't, yeah.

I still don't believe
we're replacing ourselves.

I don't think the agents will
ever in, in the near term.

And as far as we can see out, let's
say five years they will, they

won't be running all infrastructure
in the world by themselves.

They can't turn on servers.

They maybe you can actually pixie boot
and do a power on a POE or whatever, but.

Like we still need someone to give them
orders and rules and guidelines to go

do the work, but to me, I'm starting
to wonder if very quickly, especially

for those bleeding organizations that
are looking to squeeze out every cost

optimization they can of their staff,
that they're going to be mandated to

not just take AI as a code gen for yaml,
but to start using these agents to.

Increase the velocity of their
work . And my, one of my stories is

over the last 30 years I do this in
talks is every major shift has been

about speed, cost reduction in speed.

Sometimes we get 'em
both at the same time.

Sometimes they're one or the other.

We get a cost reduction, but we
don't go any faster, which is

fine, or we're going faster, but
it's not necessarily cheaper yet.

Nirmal: Right.

Bret: And.

I feel like this is maybe the next
one where We're gonna be feeling the

pressure because all the devs are
gonna be writing code with ai, which

in theory is going to improve their
performance, which means they're writing

more code, shipping more, or need, or
wanting to ship more code, potentially.

And if we're not using AI ourselves.

To automate more of these platform
designs, platform build outs,

troubleshooting when we're in production
and things are problematic and we

don't wanna spend three hours trying
to find the source of the problem.

If we're not starting to use agents to,
to automate a lot of that and reduce the

time to market, so to speak, for a certain
feature or platform feature then I don't

think these teams are gonna hire more
of us to help enable the devs to deploy.

What it could end up happening is we
end up more with more shadow ops, where

the developers are so fed up with us
not speeding up to the, if they're

gonna go 10 x we have to go 10 x. Yeah.

If they're gonna go three x or whatever
the number ends up being in the reports.

And Gartner puts out like the AI
makes it efficient, more efficient

for developers to, to code with ai.

And the models get better and
the way they use it is better.

And so they're shipping code faster and
they can do the same speed with three

times less developers, or they can just.

Produce three times more work, which I
think is more likely, because if it's

the common denominator and everyone
has it, then that means every company

can execute faster and they're gonna,
they're gonna want to do that because

their competitors are doing that.

So that's a's, that's a very
loaded and long prediction.

Nirmal: That's a hypothesis.

It's, I think there's
a lot of predict here.

It's gonna take some time for us to
even chip away at that hypothesis,

but it's a good starting point.

If we're, but assuming that is like
the hypothesis that organizations

are looking at to adopt these
tools that's a great starting point

for us to help you figure out.

what they are, why they are, what they do.

Yeah.

And how to use them.

Bret: This is this, by the way, a
lot a little bit of that opinion

of mine, and there's more to come
'cause I've got a lot more written

down than we're never gonna get to.

But a significant portion of that is
actually coming from what I've learned

this week from analyst whose job it
is to figure this stuff out for their

organization and their customers.

Interesting.

And so I, I am a little weighted by their.

Almost unrealistic expectations
of how fast we can do this.

'cause we are still humans.

An organization can't adopt AI until
the humans learn how to adopt AI and

the humans have to go at human speed.

So we can't just flip a switch
and suddenly AI is here and

running everything for us.

At least not until we
have Iron Man's Jarvis.

Or whatever.

Like until we have that, we still have
to learn these tools and still have

to adapt our platforms to use them.

Yes.

And adapt our learning to use them.

And that's gonna take some time

Nirmal: and.

I'd like to, like the parting
thought for this is Okay.

And here, like you said, there's an under
the surface kind of thing happening.

Yeah.

So whispers,

Bret: it's almost like murmurs and under

Nirmal: the surface.

Yeah.

AI agent, AI agents, mag

Bret: DevOps.

Ooh.

This is our ASMR podcast.

Moment of the podcast.

Nirmal: Like MCP protocol.

Bret: Yeah.

Nirmal: you mentioned HA proxy on
the previous podcast, about load

balancing and figuring out the
street, like token utilization of

GPUs and tokens and all that stuff.

and we had a conversation at the solo
booth and they were talking about having.

A proxy for an MCP gateway, one of
the things that we're seeing the early

signs of is these new workloads, right?

This agentic kind of thinking Around
even just executing the agentic platform,

if you will, And everything from
looking at the tokens and optimizing

load balancing to inference endpoints
or MCP is, doesn't behave the same

way as like just an http connection.

Necessarily.

And solo.

We were talking to them and
they have an MCP gateway.

We're seeing a little bit more
of a trend on AI gateways.

Is DO the project has an AI gateway and
so this is not just another workload

and looks like just a web server.

And the networking and
everything is gonna be different.

Not dramatically different,
but We'll, but drift different

enough that we need to be aware.

'cause even if you're not using
any of these tools, someone in your

organization is probably gonna say,
oh, we need to integrate this stuff

into our software, to our right.

Whatever we're delivering.

And we'll need to know
it even at that layer.

So we're gonna also cover that
component as it relates to.

The Kubernetes ecosystem, right?

And cloud native.

Bret: Yeah.

I think this, if we had to do like an
elevator pitch for this podcast, it would

be we now have a industry idea around
these terms agent, and then it uses an API

called MCP to allow us to give more work.

To these crazy robot texting things
that we have to talk to in human

language and not with code, right?

It's running code, but we're
not talking to it with code.

And that it can now understand all the
tools we need to use and we can just give

it a list of everything I wanted to use.

here's my Kubernetes API, here's
all my other things that I, you have

access to, and here's my problem.

Go solve it.

And that paradigm.

Three months ago, two months ago
for me, I didn't know existed.

And that's why I've been sitting
on the sidelines with ai.

Like it's cool for writing programs
that mostly work in a demo.

It's cool for adding a feature to
something I already have, but it's

not doing my job as a platform
engineer or DevOps engineer.

It's just helping me write text faster

Then I can type into my keyboard.

And that was not that interesting.

That's why you didn't see a lot of
me talking about that on this show,

was it just wasn't that interesting.

This is an interesting topic for ops and
for absolutely engineers on the platform.

Nirmal: Yep.

Bret: So

Nirmal: stay tuned.

Yeah.

And I, I love crazy texting robots.

Crazy

Bret: texting robots.

Maybe that's the title.

TBD.

Alright.

Alright.

See you soon, man.

See

Nirmal: you.

See you.

Bye.

Bye.

Creators and Guests

Bret Fisher
Host
Bret Fisher
Cloud native DevOps Dude. Course creator, YouTuber, Podcaster. Docker Captain and CNCF Ambassador. People person who spends too much time in front of a computer.
Nirmal Mehta
Host
Nirmal Mehta
Principal Specialist Solutions Architect at Amazon Web Services (AWS)
Beth Fisher
Producer
Beth Fisher
Producer of the DevOps and Docker Talk and Agentic DevOps podcasts. Assistant producer on Bret Fisher Live show on YouTube. Business and proposal writer by trade.
Cristi Cotovan
Editor
Cristi Cotovan
Video editor and educational content producer. Descript and Camtasia coach.
Is AI ready for DevOps?
Broadcast by