Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.ย Learn more
Particularly in this dawning era of generative AI, cloud costs are at an all-time high. But thatโs not merely because enterprises are using more compute โ theyโre not using it efficiently. In fact, just this year, enterprises are expected to waste $44.5 billion on unnecessary cloud spending.ย
This is an amplified problem for Akamai Technologies: The company has a large and complex cloud infrastructure on multiple clouds, not to mention numerous strict security requirements.
To resolve this, the cybersecurity and content delivery provider turned to the Kubernetes automation platformย Cast AI, whose AI agents help optimize cost, security and speed across cloud environments.ย
Ultimately, the platform helped Akamai cut between 40% to 70% of cloud costs, depending on workload.ย
โWe needed a continuous way to optimize our infrastructure and reduce our cloud costs without sacrificing performance,โ Dekel Shavit, senior director of cloud engineering at Akamai, told VentureBeat. โWeโre the ones processing security events. Delay is not an option. If weโre not able to respond to a security attack in real time, we have failed.โ
Specialized agents that monitor, analyze and act
Kubernetes manages the infrastructure that runs applications, making it easier to deploy, scale and manage them, particularly in cloud-native and microservices architectures.
Cast AI has integrated into the Kubernetes ecosystem to help customers scale their clusters and workloads, select the best infrastructure and manage compute lifecycles, explained founder and CEO Laurent Gil. Its core platform is Application Performance Automation (APA), which operates through a team of specialized agents that continuously monitor, analyze and take action to improve application performance, security, efficiency and cost. Companies provision only the compute they need from AWS, Microsoft, Google or others.
APA is powered by several machine learning (ML) models with reinforcement learning (RL) based on historical data and learned patterns, enhanced by an observability stack and heuristics. It is coupled with infrastructure-as-code (IaC) tools on several clouds, making it a completely automated platform.
Gil explained that APA was built on the tenet that observability is just a starting point; as he called it, observability is โthe foundation, not the goal.โ Cast AI also supports incremental adoption, so customers donโt have to rip out and replace; they can integrate into existing tools and workflows. Further, nothing ever leaves customer infrastructure; all analysis and actions occur within their dedicated Kubernetes clusters, providing more security and control.
Gil also emphasized the importance of human-centricity. โAutomation complements human decision-making,โ he said, with APA maintaining human-in-the-middle workflows.
Akamaiโs unique challenges
Shavit explained that Akamaiโs large and complex cloud infrastructure powers content delivery network (CDN) and cybersecurity services delivered to โsome of the worldโs most demanding customers and industriesโ while complying with strict service level agreements (SLAs) and performance requirements.
He noted that for some of the services they consume, theyโre probably the largest customers for their vendor, adding that they have done โtons of core engineering and reengineeringโ with their hyperscaler to support their needs.ย
Further, Akamai serves customers of various sizes and industries, including large financial institutions and credit card companies. The companyโs services are directly related to its customersโ security posture.ย
Ultimately, Akamai needed to balance all this complexity with cost. Shavit noted that real-life attacks on customers could drive capacity 100X or 1,000X on specific components of its infrastructure. But โscaling our cloud capacity by 1,000X in advance just isnโt financially feasible,โ he said.ย
His team considered optimizing on the code side, but the inherent complexity of their business model required focusing on the core infrastructure itself.ย
Automatically optimizing the entire Kubernetes infrastructure
What Akamai really needed was a Kubernetes automation platform that could optimize the costs of running its entire core infrastructure in real time on several clouds, Shavit explained, and scale applications up and down based on constantly changing demand. But all this had to be done without sacrificing application performance.
Before implementing Cast, Shavit noted that Akamaiโs DevOps team manually tuned all its Kubernetes workloads just a few times a month. Given the scale and complexity of its infrastructure, it was challenging and costly. By only analyzing workloads sporadically, they clearly missed any real-time optimization potential.ย
โNow, hundreds of Cast agents do the same tuning, except they do it every second of every day,โ said Shavit.ย
The core APA features Akamai uses are autoscaling, in-depth Kubernetes automation with bin packing (minimizing the number of bins used), automatic selection of the most cost-efficient compute instances, workload rightsizing, Spot instance automation throughout the entire instance lifecycle and cost analytics capabilities.
โWe got insight into cost analytics two minutes into the integration, which is something weโd never seen before,โ said Shavit. โOnce active agents were deployed, the optimization kicked in automatically, and the savings started to come in.โ
Spot instances โ where enterprises can access unused cloud capacity at discounted prices โ obviously made business sense, but they turned out to be complicated due to Akamaiโs complex workloads, particularly Apache Spark, Shavit noted. This meant they needed to either overengineer workloads or put more working hands on them, which turned out to be financially counterintuitive.ย
With Cast AI, they were able to use spot instances on Spark with โzero investmentโ from the engineering team or operations. The value of spot instances was โsuper clearโ; they just needed to find the right tool to be able to use them. This was one of the reasons they moved forward with Cast, Shavit noted.ย
While saving 2X or 3X on their cloud bill is great, Shavit pointed out that automation without manual intervention is โpriceless.โ It has resulted in โmassiveโ time savings.
Before implementing Cast AI, his team was โconstantly moving around knobs and switchesโ to ensure that their production environments and customers were up to par with the service they needed to invest in.ย
โHands down the biggest benefit has been the fact that we donโt need to manage our infrastructure anymore,โ said Shavit. โThe team of Castโs agents is now doing this for us. That has freed our team up to focus on what matters most: Releasing features faster to our customers.โ
Editorโs note: At this monthโs VB Transform, Google Cloud CTO Will Grannis and Highmark Health SVP and Chief Analytics Officer Richard Clarke will discuss the new AI stack in healthcare and the real-world challenges of deploying multi-model AI systems in a complex, regulated environment. Register today.
source: https://venturebeat.com/data-infrastructure/cutting-cloud-waste-at-scale-akamai-saves-70-using-ai-agents-orchestrated-by-kubernetes/


