What a Platform Engineer is supposed to know

The title is relatively new, but the problem it solves is not. As engineering organizations grow, the gap between "infrastructure people" and "product people" quietly becomes a bottleneck. Deployments slow down. On-call becomes unsustainable. Every team reinvents the same CI pipeline. Platform engineering emerged as a direct answer to that friction, not by adding more process, but by building better tools from the inside out.

Platform engineering is one of those roles that sounds straightforward until you try to explain it at a dinner table with your Family. You touch infrastructure, but you are not a sysadmin. You write code, but you are not a software engineer. You care deeply about developer experience, but you are not a product manager. The truth is you are a bit of all of those things, and that blend is exactly what makes the role both challenging and genuinely exciting.

This article shares what I have learned working with Kubernetes, GitLab, ArgoCD, Backstage, Datadog, and cloud platforms like GCP and AWS. It is not a definitive guide. It is an honest picture of what the job actually demands, and a look at how AI is starting to reshape some of those expectations.

The foundation: you need to understand how things run

Before you can build a platform for other engineers, you need a solid mental model of the infrastructure underneath it. In practice, this means you cannot treat Kubernetes as a black box. You need to understand how pods are scheduled, what resource limits actually do to your workloads, how networking flows between services, and why a node can be "Ready" but still misbehaving.

On GCP, that means understanding how GKE manages node pools, how IAM permissions propagate, and how services like Pub/Sub, Cloud SQL, or Memorystore fit into a production architecture. On AWS, the same principle applies: knowing that an S3 bucket policy and an IAM role policy are evaluated together, or how VPC peering actually routes traffic, saves hours of debugging.

The point is not to memorize every API. The point is to reason confidently about why something is broken, and to design systems that fail predictably rather than mysteriously.

Infrastructure-as-code is not optional at this level. Terraform is the industry standard for a reason: it gives you a reproducible, reviewable, and auditable description of your infrastructure. But knowing how to write resources is only the beginning. The more important skill is understanding state management, how to handle drift, and how to structure modules so that other teams can consume them without stepping on each other.

CI/CD is the heartbeat of the platform

If infrastructure is the skeleton, CI/CD pipelines are the heartbeat. Every deployment, every test run, every release goes through them. And when they are slow, flaky, or hard to understand, the entire engineering organization feels it.

GitLab CI/CD is where I have spent a lot of time. The power of the platform comes from its composability: includes, extends, parent-child pipelines, rules, and artifacts can be combined to build pipelines that are both DRY and flexible. But that same power makes it easy to create something unmaintainable. The discipline of a platform engineer is to design pipeline templates that product teams can use confidently without needing to understand all the internals.

ArgoCD adds another dimension to this. With GitOps, the pipeline is no longer just about building and testing. It becomes the mechanism that reconciles what is declared in a repository with what is running in the cluster. This shifts how you think about deployments: instead of running a script that pushes changes, you update a manifest and let the system converge. That model brings enormous reliability benefits, but it also requires you to think carefully about environment promotion, secret management, and what "drift" means in a GitOps context.

The practical expectation here is that a platform engineer can design a deployment workflow end to end, from a developer pushing a commit to a container running in production, with clear rollback paths, good visibility into what is happening, and minimal manual intervention.

Backstage: building the internal developer portal

One of the more underappreciated investments a platform team can make is an internal developer portal. Backstage, the open-source framework from Spotify, is the most common foundation for this today.

At its core, Backstage is a catalog. It gives every service, library, pipeline, and piece of infrastructure a home: a page where you can see who owns it, what it depends on, where it is deployed, and how healthy it is. For an organization managing hundreds of repositories, that alone is transformative.

But the real value comes from plugins. A GitLab plugin surfaces merge requests and pipeline status directly in the catalog. A Kubernetes plugin shows live pod health. A custom plugin can expose whatever internal tool your organization relies on. The platform engineer's job is to wire these things together in a way that reduces context switching for developers and makes the right information visible at the right time.

Building Backstage plugins means writing TypeScript and React, which is a different muscle from infrastructure work. That is intentional. Platform engineering at scale requires you to be comfortable across multiple layers of the stack.

Observability is not monitoring

This distinction matters more than it might seem. Monitoring is checking whether known things are working. Observability is building systems you can interrogate when something unexpected happens.

Datadog is where I have done most of this work. Metrics, logs, and traces are the three pillars, and a mature observability setup connects all three. A spike in latency should lead you directly to the affected traces, which should point to the relevant logs, which should give you enough context to understand the root cause without needing to SSH into a server.

In practice, this means instrumentation is a first-class concern. Platform engineers should define the standards for how services emit metrics and structured logs, so that when something breaks at 2am, the person on call has something useful to look at. This includes defining SLIs and SLOs: concrete, measurable targets for reliability that the platform team and product teams agree on together.

Alerts should be actionable. An alert that fires without a clear path to resolution is noise, and noise leads to alert fatigue, which leads to real incidents being ignored. Building a healthy alert culture is as much a cultural challenge as a technical one.

Custom tooling: when you need to build what does not exist

Sometimes the right answer is to build something. Internal CLI tools, automation scripts, custom Kubernetes operators, pipeline utilities: these are the kinds of things a platform team reaches for when off-the-shelf solutions do not fit.

Go has become my language of choice for this. It compiles to a single binary, which makes distribution trivial. The standard library is rich enough for most infrastructure tooling needs. The concurrency model is practical for the kinds of workloads that come up: watching Kubernetes resources, calling APIs in parallel, processing event streams.

The expectation is not that every platform engineer is a senior software engineer. But you should be comfortable writing code that other people will maintain. That means readable structure, useful error messages, and documentation that explains not just what the tool does but why it exists and when to use it.

Security and secrets management are not someone else's problem

In smaller organizations, security tends to be deferred. In a platform engineering context, it becomes part of the fabric of everything you build. The platform is the leverage point: if you bake security practices into the golden paths, teams get them for free without having to think about them.

Secrets management is the clearest example. Hardcoded credentials in CI pipelines are one of the most common sources of security incidents. A platform team that integrates HashiCorp Vault or a cloud-native equivalent like GCP Secret Manager into the standard deployment workflow removes that risk at scale.

RBAC, image scanning, policy enforcement with tools like OPA or Kyverno, supply chain security with Renovate for automated dependency updates: these are not exotic concerns. They are the baseline hygiene that a platform team is responsible for encoding into the system.

The skill that is hardest to teach: product thinking

Everything described above is learnable. The hardest part of platform engineering is developing genuine empathy for the developers who use what you build.

The platform is a product. It has users. Those users have frustrations, workflows, and deadlines. A platform team that builds technically impressive infrastructure that nobody understands or uses has failed at its core job, regardless of how elegant the Terraform modules are.

This means talking to developers regularly. It means running feedback sessions, tracking adoption of new tooling, and being honest about when something you built is not working for people. It means writing documentation that assumes the reader is smart but not omniscient. It means being willing to simplify something you spent weeks building because the complexity is not worth the cognitive load it imposes.

The measure of a good platform team is not the tools it ships. It is the reduction in friction that developers experience over time.

How AI is changing the picture

This is where things get genuinely interesting, and genuinely uncertain.

AI coding assistants are already changing how platform engineers work. Generating Terraform modules, writing Kubernetes manifests, scaffolding Go services, drafting runbooks: tasks that used to take hours can now take minutes with a good prompt and a capable model. That is not a threat to the role. It is an acceleration of it.

The more interesting shift is in what becomes possible. AI-assisted incident response, where a model can correlate signals across metrics, logs, and traces and surface a hypothesis, is already emerging. AI-generated pipeline templates that adapt to a project's language and structure. Internal chatbots that can answer "how do I deploy to staging" using your actual documentation as context.

The platform engineer of the near future is not just someone who builds tools. They are someone who understands how to embed intelligence into the platform itself, reducing the cognitive load on developers even further than was previously possible.

That requires the same foundation described throughout this article, the infrastructure knowledge, the CI/CD craft, the observability instincts, the security discipline. AI does not replace that foundation. It builds on top of it. The engineers who will thrive are those who combine deep operational experience with curiosity about what these new tools make possible.

What it actually takes

Platform engineering is not a role you can fake with certifications alone. It is built from experience: the accumulated understanding that comes from debugging a GKE cluster at midnight, from tracing a latency issue through three services in Datadog, from realizing that your beautifully designed pipeline template is unusable because you forgot to document one required variable.

What it takes, in short:

A genuine curiosity about how distributed systems work and fail
The discipline to automate what should not require human intervention
The empathy to build tools that other engineers actually want to use
The engineering craft to write code that lasts longer than the sprint it was written in
The openness to keep learning, because the landscape keeps shifting

The role is hard to put in a box. That is also what makes it one of the most interesting places to be in engineering right now.