Top 5 This Week

Related Posts

Cutting cloud waste at scale: Akamai saves 70% using AI agents orchestrated by kubernetes


Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.ย Learn more


Particularly in this dawning era of generative AI, cloud costs are at an all-time high. But thatโ€™s not merely because enterprises are using more compute โ€” theyโ€™re not using it efficiently. In fact, just this year, enterprises are expected to waste $44.5 billion on unnecessary cloud spending.ย 

This is an amplified problem for Akamai Technologies: The company has a large and complex cloud infrastructure on multiple clouds, not to mention numerous strict security requirements.

To resolve this, the cybersecurity and content delivery provider turned to the Kubernetes automation platformย Cast AI, whose AI agents help optimize cost, security and speed across cloud environments.ย 

Ultimately, the platform helped Akamai cut between 40% to 70% of cloud costs, depending on workload.ย 

โ€œWe needed a continuous way to optimize our infrastructure and reduce our cloud costs without sacrificing performance,โ€ Dekel Shavit, senior director of cloud engineering at Akamai, told VentureBeat. โ€œWeโ€™re the ones processing security events. Delay is not an option. If weโ€™re not able to respond to a security attack in real time, we have failed.โ€

Specialized agents that monitor, analyze and act

Kubernetes manages the infrastructure that runs applications, making it easier to deploy, scale and manage them, particularly in cloud-native and microservices architectures.

Cast AI has integrated into the Kubernetes ecosystem to help customers scale their clusters and workloads, select the best infrastructure and manage compute lifecycles, explained founder and CEO Laurent Gil. Its core platform is Application Performance Automation (APA), which operates through a team of specialized agents that continuously monitor, analyze and take action to improve application performance, security, efficiency and cost. Companies provision only the compute they need from AWS, Microsoft, Google or others.

APA is powered by several machine learning (ML) models with reinforcement learning (RL) based on historical data and learned patterns, enhanced by an observability stack and heuristics. It is coupled with infrastructure-as-code (IaC) tools on several clouds, making it a completely automated platform.

Gil explained that APA was built on the tenet that observability is just a starting point; as he called it, observability is โ€œthe foundation, not the goal.โ€ Cast AI also supports incremental adoption, so customers donโ€™t have to rip out and replace; they can integrate into existing tools and workflows. Further, nothing ever leaves customer infrastructure; all analysis and actions occur within their dedicated Kubernetes clusters, providing more security and control.

Gil also emphasized the importance of human-centricity. โ€œAutomation complements human decision-making,โ€ he said, with APA maintaining human-in-the-middle workflows.

Akamaiโ€™s unique challenges

Shavit explained that Akamaiโ€™s large and complex cloud infrastructure powers content delivery network (CDN) and cybersecurity services delivered to โ€œsome of the worldโ€™s most demanding customers and industriesโ€ while complying with strict service level agreements (SLAs) and performance requirements.

He noted that for some of the services they consume, theyโ€™re probably the largest customers for their vendor, adding that they have done โ€œtons of core engineering and reengineeringโ€ with their hyperscaler to support their needs.ย 

Further, Akamai serves customers of various sizes and industries, including large financial institutions and credit card companies. The companyโ€™s services are directly related to its customersโ€™ security posture.ย 

Ultimately, Akamai needed to balance all this complexity with cost. Shavit noted that real-life attacks on customers could drive capacity 100X or 1,000X on specific components of its infrastructure. But โ€œscaling our cloud capacity by 1,000X in advance just isnโ€™t financially feasible,โ€ he said.ย 

His team considered optimizing on the code side, but the inherent complexity of their business model required focusing on the core infrastructure itself.ย 

Automatically optimizing the entire Kubernetes infrastructure

What Akamai really needed was a Kubernetes automation platform that could optimize the costs of running its entire core infrastructure in real time on several clouds, Shavit explained, and scale applications up and down based on constantly changing demand. But all this had to be done without sacrificing application performance.

Before implementing Cast, Shavit noted that Akamaiโ€™s DevOps team manually tuned all its Kubernetes workloads just a few times a month. Given the scale and complexity of its infrastructure, it was challenging and costly. By only analyzing workloads sporadically, they clearly missed any real-time optimization potential.ย 

โ€œNow, hundreds of Cast agents do the same tuning, except they do it every second of every day,โ€ said Shavit.ย 

The core APA features Akamai uses are autoscaling, in-depth Kubernetes automation with bin packing (minimizing the number of bins used), automatic selection of the most cost-efficient compute instances, workload rightsizing, Spot instance automation throughout the entire instance lifecycle and cost analytics capabilities.

โ€œWe got insight into cost analytics two minutes into the integration, which is something weโ€™d never seen before,โ€ said Shavit. โ€œOnce active agents were deployed, the optimization kicked in automatically, and the savings started to come in.โ€

Spot instances โ€” where enterprises can access unused cloud capacity at discounted prices โ€” obviously made business sense, but they turned out to be complicated due to Akamaiโ€™s complex workloads, particularly Apache Spark, Shavit noted. This meant they needed to either overengineer workloads or put more working hands on them, which turned out to be financially counterintuitive.ย 

With Cast AI, they were able to use spot instances on Spark with โ€œzero investmentโ€ from the engineering team or operations. The value of spot instances was โ€œsuper clearโ€; they just needed to find the right tool to be able to use them. This was one of the reasons they moved forward with Cast, Shavit noted.ย 

While saving 2X or 3X on their cloud bill is great, Shavit pointed out that automation without manual intervention is โ€œpriceless.โ€ It has resulted in โ€œmassiveโ€ time savings.

Before implementing Cast AI, his team was โ€œconstantly moving around knobs and switchesโ€ to ensure that their production environments and customers were up to par with the service they needed to invest in.ย 

โ€œHands down the biggest benefit has been the fact that we donโ€™t need to manage our infrastructure anymore,โ€ said Shavit. โ€œThe team of Castโ€™s agents is now doing this for us. That has freed our team up to focus on what matters most: Releasing features faster to our customers.โ€

Editorโ€™s note: At this monthโ€™s VB Transform, Google Cloud CTO Will Grannis and Highmark Health SVP and Chief Analytics Officer Richard Clarke will discuss the new AI stack in healthcare and the real-world challenges of deploying multi-model AI systems in a complex, regulated environment. Register today.

#Cutting #cloud #waste #scale #Akamai #saves #agents #orchestrated #kubernetes
source: https://venturebeat.com/data-infrastructure/cutting-cloud-waste-at-scale-akamai-saves-70-using-ai-agents-orchestrated-by-kubernetes/

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Popular Articles