Skip to main content

Establishing Sustainable Right-Sizing Practices for Cloud Resources

A FinOps-aligned, cloud-agnostic framework for balancing financial efficiency and operational resilience

Authors: Ismael Rivera Álvarez, Bernd Kaponig, Michiel de Vos, Tim Rab Date: January 2026


Summary

This paper presents a FinOps-aligned, cloud-agnostic framework for establishing a sustainable practice for right-sizing cloud resources.

It is intended to help organizations:

  • Systematically identify and remove wasteful over-provisioning without compromising reliability.
  • Move from ad-hoc tuning to a repeatable, measurable capability embedded in day-to-day engineering work.
  • Integrate right-sizing with FinOps maturity, personas, and governance, rather than treating it as a one-off cost-cutting exercise.
  • Align right-sizing activities across compute, storage, databases, containers/Kubernetes, and network in multi-cloud environments.

After reading this paper, a practitioner should be able to:

  • Design and run a right-sizing program mapped to Crawl / Walk / Run maturity.
  • Identify the highest-value optimization levers across major cloud resource types.
  • Define KPIs, guardrails, and workflows that support sustainable optimization instead of one-time cost reductions.
  • Connect right-sizing decisions to FinOps capabilities such as Resource Utilization & Efficiency, Cost Allocation, Forecasting, and Workload Management & Automation.

Who Should Read This Paper

This paper is written for:

  • FinOps practitioners and teams
    Those responsible for cloud cost visibility, optimization, and cross-functional collaboration.
  • Engineering and platform teams
    Developers, SREs, platform engineers, architects, and operations teams who design, run, and scale services.
  • Product and business owners
    Owners of services or portfolios who are accountable for service quality and cost outcomes.
  • Finance and FP&A
    Teams that translate right-sizing and optimization outcomes into budgets, forecasts, and business decisions.
  • Technology and business leadership
    Leaders sponsoring FinOps as a practice who need a structured way to align cost efficiency with strategic priorities.

Prerequisites

Readers will get the most value out of this paper if they have:

  • A basic understanding of cloud infrastructure and services (IaaS, PaaS, managed services).
  • Familiarity with the FinOps Framework, particularly:
    • Resource Utilization & Efficiency
    • Cost Allocation & Showback
    • Workload Management & Automation
    • Forecasting & Budgeting
  • Access to at least a minimal set of:
    • Cloud usage and cost data
    • Monitoring/observability metrics
    • Tagging or account structure to map resources to products, teams, or cost centers.

1. Introduction

1.1 Background and Significance

The widespread adoption of cloud computing has fundamentally changed how organizations provision, consume, and pay for infrastructure. Instead of fixed capacity in centralized data centers, teams now work with effectively infinite, elastic capacity available on demand.

While this has unlocked enormous flexibility and speed, it has also introduced new forms of waste, fragmentation, and complexity:

  • Over-provisioned instances, clusters, and databases sized for "worst case" scenarios.
  • Idle or zombie resources that are no longer needed but continue to incur cost.
  • Overly conservative architectures that trade large amounts of cost for small theoretical increases in resilience.
  • Underused high-performance storage, networking, and premium features that deliver little incremental value.

How to Think About Right-Sizing

Traditionally, organisations view right-sizing as a reactive "spring cleaning" event. A periodic purge of waste triggered by budget overruns.

In a FinOps context, you should instead think of right-sizing as continuous inventory velocity. Just as a retailer does not want inventory sitting on a shelf gathering dust, a specialised engineering organisation does not want compute capacity sitting idle. Right-sizing is the active discipline of ensuring every unit of infrastructure rented is converting directly into business value. It balances the "Iron Triangle" of Cost, Speed, and Quality.

Right-sizing is the discipline of aligning resource capacity with actual workload needs over time. When systematically applied, it becomes one of the most powerful FinOps levers to:

  • Unlock "hidden" capacity and budget without asking for new investment.
  • Enhance sustainability by minimising unnecessary infrastructure and energy consumption.
  • Create tangible feedback loops between engineering and finance.

1.2 Objectives and Scope

This paper focuses on:

  • Establishing sustainable right-sizing practices for:
    • Compute (VMs, serverless, managed compute)
    • Storage (block, object, file)
    • Databases (managed and self-hosted)
    • Containers and Kubernetes
    • Network components (load balancers, gateways, connectivity)
  • Integrating right-sizing with:
    • FinOps maturity (Crawl / Walk / Run)
    • Key personas and their responsibilities
    • KPIs and indicators of success
    • Tooling and automation strategies
  • Providing practitioner-oriented examples, patterns, and visuals to support implementation.

We focus on cloud-agnostic concepts and practices that can be applied across providers (AWS, Azure, GCP, and others), while acknowledging provider-specific details where useful.


2. Right-Sizing in the Context of FinOps

2.1 FinOps Foundations: Resource Utilisation & Efficiency

The FinOps Framework defines Resource Utilisation & Efficiency as the capability to ensure cloud resources are configured and used optimally to deliver business value.

Core principles relevant to right-sizing include:

  • Visibility
    Having a consistent, trusted view of what resources exist, what they cost, and how they are used.
  • Measurement
    Translating utilization metrics (CPU, memory, storage, IOPS, network, requests, etc.) into meaningful signals of efficiency or waste.
  • Optimization
    Taking targeted actions to improve the relationship between cost, performance, and risk, not just one of them.
  • Iteration
    Treating right-sizing as an ongoing practice, not a one-time clean-up.

In practice, right-sizing integrates tightly with the Inform → Optimize → Operate FinOps lifecycle:

Right-Sizing Lifecycle Loop

2.2 The Evolution of Right-Sizing

Most organizations follow a similar evolutionary path:

1

Ad-hoc clean-ups

Someone notices an expensive instance or cluster and manually resizes or shuts it down.

2

Periodic, manual reviews

Teams begin to analyze dashboards monthly or quarterly and apply one-off resizing or scheduling.

3

Centralized FinOps-led analysis

A FinOps team starts generating reports and recommendations across accounts or subscriptions, often using provider-native or FinOps tooling.

4

Standardized governance and processes

Right-sizing is built into change management, design reviews, and operational runbooks.

5

Automated, policy-driven optimization

Actions (e.g., downsize, terminate, schedule off) are executed automatically within clearly defined guardrails.

This paper aims to help practitioners accelerate that journey while maintaining trust, safety, and reliability.

3. Personas and Collaboration

Right-sizing is not a purely technical exercise; it is a cross-functional practice.

3.1 Personas Involved

PersonaPrimary Responsibilities in Right-Sizing
FinOps Practitioner / TeamRun analyses, define policies, coordinate across teams, quantify savings/impact.
Engineering / Platform / SREDesign and operate services, apply changes, validate performance and reliability.
Product / Service OwnerAccept trade-offs, prioritize optimization vs feature work, own business value.
Finance / FP&AIntegrate optimization into budgets and forecasts, validate financial outcomes.
Leadership (Tech & Business)Sponsor the practice, set expectations, resolve conflicts, and align incentives.
Supporting Personas (ITAM, Security)Ensure compliance, license alignment, and appropriate guardrails.

3.2 Collaboration Patterns

Successful right-sizing practices share the following collaboration traits:

  • Shared language:
    Teams agree on what "right-sized" means per workload type (e.g., target utilization ranges, SLOs).
  • Transparent data:
    Cost and performance data is shared broadly, with self-service access.
  • Joint decision-making:
    Engineers, product owners, and FinOps jointly decide on significant changes to critical workloads.
  • Feedback loops:
    Changes are reviewed post-implementation to confirm impact on cost and reliability, and lessons are shared.

4. Maturity Model for Right-Sizing

Right-sizing maturity can be expressed as Crawl / Walk / Run, aligned with FinOps maturity:

4.1 Maturity Overview

StageCharacteristicsTypical Behaviors & Activities
CrawlReactive, ad-hoc, limited visibilityManual clean-ups, isolated efforts, limited reporting.
WalkStructured, repeatable, cross-team awarenessScheduled reviews, defined playbooks, partial automation.
RunIntegrated, policy-driven, automatedContinuous optimization, embedded in SDLC (Software Development Lifecycle) and operations.

4.2 Detailed Maturity Characteristics

Crawl – Initial & Defined

Focus:
Basic visibility and low-risk savings.

Activities:

  • Implementing cost and utilization dashboards per account/subscription.
  • Tagging or grouping resources by owner, environment, or product.
  • Manual identification and clean-up of:
    • Idle instances
    • Unattached storage volumes
    • Zombie load balancers and IPs

Tooling:
Native cloud cost explorers, basic custom reporting, spreadsheets.

Walk – Rolled Out & Developed

Focus:
Establish right-sizing as a recurring practice.

Activities:

  • Monthly or bi-weekly right-sizing review meetings for top spenders.
  • Standard operating procedures (SOPs) for resizing, scheduling, and clean-ups.
  • Integrating recommendations from:
    • Cloud native recommendation tools
    • Third-party optimization platforms
  • Using right-sizing signals as input into commitment decisions (RIs, savings plans).

Tooling:
Enhanced dashboards, anomaly detection, and recommendation engines.

Run – Standardized & Sustained

Focus:
Automation, guardrails, and continuous improvement.

Activities:

  • Policy-based automation for:
    • Instance resizing and scheduling
    • Storage lifecycle transitions
    • Container and cluster auto-scaling
  • Embedding right-sizing checks into:
    • CI/CD pipelines
    • Architectural review processes
    • Platform APIs ("paved roads")
  • Systematically tracking KPIs, SLOs, and incident impact.

Tooling:
FinOps platforms, infrastructure-as-code, event-driven automation, ML-powered analytics.

5. Foundational Practices

5.1 Data and Observability

A right-sizing practice depends on consistent, reliable data:

  • Cost and usage at a minimally granular level (per resource or service).
  • Utilization metrics relevant to each resource type (CPU, memory, disk, network, IOPS, requests).
  • Performance and reliability metrics (latency, error rate, availability, saturation).

Best practice is to provide a single pane of glass where teams can see:

Data Observability Dashboard

User story: for this "service/workload", in this "environment", here is its utilization, performance, and cost over time.

5.2 Automation and Decision Support

Manual right-sizing does not scale. Automation should be introduced in stages:

1

Automated Data Collection

Centralized ingestion of metrics from cloud providers, APM tools, and log systems.

2

Recommendation Engines

Rule-based or ML-based engines that suggest:

  • Down/upsizing
  • Alternative instance types
  • Storage tier transitions
  • Schedule optimizations
3

Policy-Driven Execution

Declarative policies that define:

  • Conditions under which actions are automatically taken.
  • Constraints (e.g., only for non-prod, only certain tags, limits on percentage change).
4

Feedback and Learning

Capture the impact of changes and feed that back into rules, models, and governance.

Right-Sizing Optimization Loop

5.3 Balancing Cost, Reliability, and Performance

Right-sizing is not purely about cutting cost. Sustainable practices:

  • Start from business outcomes and SLOs, not cost targets alone.
  • Use workload profiles to understand:
    • Baseline load
    • Burst characteristics
    • Seasonality
  • Incorporate testing and validation:
    • Load tests, stress tests, chaos experiments.
  • Design for graceful degradation and elasticity:
    • Allow services to scale or degrade predictably instead of failing abruptly.
  • Maintain rollback mechanisms:
    • Easy reversals of changes if the impact is negative.

6. Right-Sizing Opportunities by Layer

This section highlights common right-sizing levers across five layers. It is not exhaustive, but it focuses on high-impact opportunities that frequently appear in FinOps work.

6.1 Compute Resources (VMs, Managed Compute, Serverless)

6.1.1 Key Levers

Focus AreaWhat to Look ForTypical Actions
Instance size vs utilizationCPU/Memory < 20–30% over long periodsDownsize to smaller instance, split workload, or consolidate.
Instance family alignmentMismatch between workload profile & instance familyMove to compute, memory, or general-purpose family as needed.
Over-provisioned non-productionIdle or low-use dev/test/stage environmentsImplement schedules, suspend or terminate unused resources.
Underused autoscaling groupsGroups always at max size or rarely scaling downAdjust scaling policies, cooldowns, and metrics.
Low-density workloadsMany small instances with low utilizationConsolidate workloads onto fewer instances.
Over-reliance on on-demandStable, predictable workloads fully on-demandCombine with RIs/savings plans; right-size then commit.
Misused burstable instancesBurstable instances constantly at or above baselineMove to non-burstable types; adjust workloads or architecture.
"Always-on" utility servicesJump boxes, bastion hosts, build agents running 24/7Schedule or replace with on-demand / serverless alternatives.

6.1.2 Practical Considerations

  • Right-sizing compute often yields fast, visible savings, but can also introduce risk if:
    • Performance patterns are not well understood.
    • SLOs are not clearly defined and monitored.
  • Start with non-production and low-risk workloads to build confidence and refine processes.

6.2 Storage Resources (Block, Object, File)

6.2.1 Key Levers

Focus AreaWhat to Look ForTypical Actions
Tiering & lifecycle policiesLarge data volumes rarely accessedApply lifecycle rules; move cold data to cheaper tiers.
Volume over-provisioningVolumes significantly larger than used capacityResize volumes where supported or migrate to appropriately sized.
Unused or orphaned storageVolumes not attached, old snapshots, stale bucketsIdentify and delete after validation.
Redundant or duplicated dataMultiple copies of similar datasetsDeduplicate, compress, or consolidate datasets.
Overkill performance settingsHigh IOPS/throughput volumes with low I/O usageDowngrade to cheaper storage tiers or lower performance configs.
Excessive snapshot/backup frequencyVery frequent snapshots with no clear retention policyRationalize retention, frequency, and retention tier.
Inefficient log and telemetry storageLogs retained indefinitely in hot storageMove to cold/archive tiers; shorten retention where appropriate.

6.2.2 Practical Considerations

  • Changes to storage may affect RPO (Recovery Point Objective) / RTO (Recovery Time Objective), legal retention requirements, or auditability.
  • Involve security, compliance, and data owners when adjusting policies.
  • Track storage costs by lifecycle stage (hot, cool, archive) to show benefits.

6.3 Database Services

6.3.1 Key Levers

Focus AreaWhat to Look ForTypical Actions
Instance size & classSustained low CPU/Memory, I/O well below limitsDownsize DB instance; adjust storage throughput provisioned.
Replica countsMany replicas with low read trafficReduce number of replicas; consider read scaling patterns.
Storage allocation & auto-scalingLarge allocated storage vs actual usageReduce allocated storage (where supported); tune autoscaling rules.
Backup & snapshot strategyExcessive backup copies and long retentionRationalize retention; move older backups to cheaper storage tiers.
Query & index inefficienciesConstant performance complaints despite large DBsTune queries and indexes before scaling up hardware.
Caching and connection poolingHigh DB load from repeated reads or poor connection useIntroduce caching layers; fix connection pooling configuration.

6.3.2 Practical Considerations

  • Right-sizing databases is high-stakes:
    • Latency, error rates, and availability may be highly sensitive to capacity.
  • Begin with read replicas, dev/stage databases, or non-critical workloads.
  • Use changes as opportunities to pay down technical debt:
    • Index cleanup
    • Query optimization
    • Partitioning and archiving strategies

6.4 Container / Kubernetes Workloads

6.4.1 Key Levers

Focus AreaWhat to Look ForTypical Actions
Pod CPU requests/limitsRequests much higher than actual usageLower CPU requests; adjust limits with safe headroom.
Pod memory requests/limitsHigh requests with no OOM (Out Of Memory) events and low usageReduce memory requests; tune GC (Garbage Collection), caching, and app memory settings.
Node types and sizesHigh fragmentation or low node utilizationConsolidate to fewer node sizes; choose node types that fit workloads.
Cluster auto-scaling configurationClusters rarely scaling down or stuck at high capacityAdjust scale-up/down policies, timeouts, and metrics.
Namespace quotas and limitsNo quotas; "noisy neighbor" issuesImplement quotas; align to team/workload budgets and expectations.
Image and resource garbage collectionLarge numbers of unused images, pods, or PVCs (Persistent Volume Claims)Implement regular cleanup jobs and policies.
Use of spot/preemptible nodesSuitable workloads running only on on-demand nodesMove batch/CI/stateless workloads to spot/preemptible nodes.
Logging verbosityExcessive log volume consuming CPU, storage, networkReduce log level; route certain logs to cheaper destinations.

6.4.2 Practical Considerations

  • Containers often hide over-provisioning:
    • Pods look small, but aggregate resource requests per node or cluster may be high.
  • Make cluster-level utilization visible to teams:
    • Map cost and metrics back to namespaces, deployments, and services.

6.5 Network and Connectivity

6.5.1 Key Levers

Focus AreaWhat to Look ForTypical Actions
Load balancer and gateway countMany small or idle load balancersConsolidate; remove unused; right-size throughput tiers.
Data transfer patternsHigh cross-zone / cross-region trafficRe-architect to minimize cross-boundary traffic where feasible.
Idle or oversized VPN/direct connectLow utilization on expensive dedicated connectionsDowngrade, consolidate, or schedule connections based on demand.
NAT gateways and public IPsMany NAT gateways/public IPs with low or no usageConsolidate NATs; remove unused public IPs; consider private access.
CDN and caching usageBackends serving all traffic; low cache hit ratioIntroduce or tune CDNs; adjust cache policies.
Network appliancesFirewall, IDS/IPS, WAF appliances with low throughputRight-size appliance tiers; autoscale where possible.
DNS and routing configurationsComplex, redundant routing setupsSimplify DNS and routing; tune TTL and caching.

6.5.2 Practical Considerations

  • Network costs can be harder to attribute; invest in clear cost allocation.
  • Work closely with security and networking teams (often separate from application teams).

7. Operating Model and Governance

7.1 Core Processes

A sustainable right-sizing practice typically includes these recurring activities:

1

Discovery & Analysis

  • Identify top N accounts/products by spend.
  • Analyze utilization and performance for their key workloads.
  • Generate a prioritized list of right-sizing candidates.
2

Review & Decision

  • Conduct regular right-sizing review sessions with Engineering, Product owners, and FinOps.
  • Decide which changes to implement, defer, or reject.
3

Implementation & Validation

  • Apply changes via Infrastructure-as-code, Orchestration tools, or Cloud consoles (ideally only for emergencies or exceptions).
  • Validate Performance, Reliability, and Cost impact.
4

Reporting & Feedback

  • Report savings and impact to Teams and Leadership.
  • Capture lessons learned and feed back into policies and automation.

FinOps Right-Sizing Collaboration

7.2 Governance and Guardrails

Key governance mechanisms:

  • Policies that define:
    • Which environments can be automatically right-sized.
    • Minimum notice or review process for production changes.
    • Required performance validation steps.
  • Exception handling:
    • Documented process for workloads that must deviate from standard policies (e.g., due to regulatory or business constraints).
  • Alignment with other frameworks:
    • ITIL / ITSM change management
    • SRE practices (error budgets, SLOs)
    • Security and compliance controls

8. Indicators of Success (KPIs)

A right-sizing practice should be measured and improved over time. Example KPIs:

8.1 Utilization and Waste

KPIDescriptionExample Target
% of compute spend right-sized or reviewedPortion of compute spend under active governance> 70% of compute spend in 12 months
Average CPU utilization of key workloadsCPU utilization in primary hours for right-sized services40–70% for steady workloads
% of storage in appropriate tiersHot vs cool vs archive mix aligned with access patterns> 60% of rarely accessed data in cheap tiers

8.2 Financial Outcomes

KPIDescriptionExample Target
Monthly savings from right-sizingGross or net savings per month from changes5–15% of addressable spend
Savings as % of total cloud spendContribution of right-sizing to overall FinOps outcomes3–10% of total cloud spend over a year
Avoided cost through right-sizingEstimated cost avoided vs "no optimization" baselineIncrease year over year

8.3 Process & Adoption

KPIDescriptionExample Target
Regularity of right-sizing review sessionsFrequency of sessions per key domain/productMonthly or bi-weekly
% of changes implemented via automationPortion of changes executed automatically> 50% for non-prod, > 25% for prod
Average lead time from recommendation to changeTime from identification to production implementation< 4 weeks for prioritized actions

8.4 Reliability & Risk

KPIDescriptionExample Target
Incidents caused by right-sizingNumber of incidents where right-sizing contributedZero or near-zero
SLO compliance before vs after changeComparative SLO adherence for right-sized workloadsNo degradation

9. Exceptions and Considerations

Not all workloads should be aggressively right-sized. Consider the following exceptions:

  • Highly variable or unpredictable workloads
    Where traffic patterns or data volumes are inherently unpredictable and hard to forecast.
  • Mission-critical, latency-sensitive services
    Where the cost of an incident vastly outweighs possible savings.
  • Regulatory or contractual obligations
    Where over-provisioning is explicitly required (e.g., for certain regulated industries).
  • Third-party or legacy systems
    Where limited observability or control prevents safe changes.

In these cases, right-sizing might focus more on:

  • Observability improvements first.
  • Architectural changes to enable elasticity.
  • Non-critical tiers (dev/test, staging, analytics) before touching production.

The following checklist can be used as a practical starting point.

10.1 Quick Wins (First 3–6 Months)

  • Identify top 5–10 products or accounts by cloud spend.
  • Implement basic cost and utilization dashboards for these areas.
  • Clean up obvious waste:
    • Idle instances
    • Unattached storage
    • Zombie load balancers, IPs, NAT gateways
  • Introduce schedules for dev/test environments.
  • Hold your first right-sizing review session with engineering and product.

10.2 Building the Practice

  • Document right-sizing guidelines:
    • Target utilization ranges per workload type.
    • Standard instance/storage types to prefer.
  • Define roles and responsibilities per persona.
  • Introduce structured recommendation pipelines:
    • Right-sizing candidate reports.
    • Review workflows.
  • Start tracking right-sizing KPIs and savings.

10.3 Moving Toward Automation

  • Implement policy-based automation for:
    • Dev/test shutdown schedules.
    • Non-critical instance downsizing.
    • Storage lifecycle transitions.
  • Integrate right-sizing checks into:
    • CI/CD pipelines
    • Design / architecture reviews
  • Establish guardrails and exception processes.
  • Regularly communicate wins (both savings and reliability outcomes) to maintain support.

11. Future Directions

As cloud and FinOps practices evolve, right-sizing will continue to intersect with:

  • AI-driven optimization
    More sophisticated ML models recommending configurations, patterns, and architectural changes.
  • Sustainability and ESG reporting
    Using right-sizing outcomes to show CO₂ and energy reductions alongside cost savings.
  • Advanced workload management
    Integrated views of:
    • Cost
    • Performance
    • Reliability
    • Sustainability

This paper is intended as a foundation upon which further playbooks, case studies, and domain-specific papers can build—for example, right-sizing for AI/ML training clusters, data platforms, or SaaS multi-tenant architectures.

Right-sizing connects directly to several FinOps capabilities, including:

  • Resource Utilization & Efficiency
    The primary capability covered in this paper.
  • Workload Management & Automation
    Embedding right-sizing into orchestration, CI/CD, and platform tooling.
  • Data Analysis & Showback
    Providing transparent cost and utilization data to teams.
  • Managing Commitment-based Discounts
    Using right-sized baselines to inform commitment decisions.
  • Forecasting
    Incorporating right-sized baselines into forward-looking cost models.

Future work can explore detailed playbooks and case studies that build on this paper, focusing on specific providers, industries, or workload patterns.


References

FinOps Foundation. (n.d.). FinOps Framework Overview (updated 2025 with Scopes addition). https://www.finops.org/framework
(Core operating model, including Phases, Personas, Capabilities such as Resource Utilization & Efficiency, Workload Management & Automation, and Cloud Sustainability.)

FinOps Foundation. (2025). State of FinOps 2025. https://data.finops.org/library
(Annual survey insights on FinOps maturity, adoption, waste trends, and emerging practices like AI and sustainability.)


FinOps Paper Metadata (Internal Use Only)

KeyValue
Size of OrganizationAny
Size of Cloud UseAny
Organizational ComplexityAny
Cloud Use ComplexityAny
Maturity of Cloud AdoptionAny
Maturity of FinOps practiceAny
FinOps team modelAny
IndustryAny
CapabilityResource Utilization & Efficiency, Workload Management & Automation, Cost Allocation
Persona InvolvedEngineering, Finance, Product, Leadership
Cloud UsedAny
Level of Technical DetailMedium
FinOps Tooling UsedN/A