Training AI agents to optimize cloud infrastructure is tricky when the feedback loop requires real cloud spend. We've been working on a simulation environment for exactly this multi-cloud (AWS, GCP, Azure, OCI, DigitalOcean), chaos injection, autoscaling, cost modeling, all accessible via REST API so agents can run episodes without touching real resources.
Curious if anyone else is working on agentic infra management or has thoughts on how to structure the reward signal for cost vs. reliability tradeoffs. Happy to share more about how the simulation engine works.
[link] [comments]