Why 80% of Your Data Should Never Hit the Cloud

Let's be honest: your data pipelines are likely costing you more than they should. Not just in terms of dollars, even though the cloud bills are certainly painful enough. They are costing you in operational drag, architectural fragility, and missed opportunities.There is a silent inefficiency in how most organizations handle data today in distributed systems: a default pattern that forces you to ship and store vast quantities of low-value information simply because the tooling offers no alternative.At Expanso, we have seen it across several industries:

Security teams drowning in terabytes of logs they can not analyze.
Observability platforms generating eye-watering invoices for metrics that are rarely consulted.
Data lakes swelling with redundant, noisy, or outright useless information.

If you are an engineer facing pressure for cloud cost optimization while needing to maintain system visibility, this scenario will resonate. You are paying for data egress, ingestion, storage, and query costs for a significant portion of data that provides minimal or zero ROI. It is time to recognize that this model is broken and transition towards a smarter approach to data handling: Compute-Over-Data.

This article covers the following:

The escalating costs of “ship everything” pipelines
Why most of this data should not be in hot storage (or cloud at all)
The real reason teams ship everything
What a smarter model looks like
The payoff
The solution: Bacalhau

Let’s dive in!

The Escalating Costs of Inefficient "Ship Everything" Data Pipelines

The traditional approach for analyzing distributed data involves centralizing all data from multiple nodes into a single location—typically, a data warehouse, an observability, or log aggregation platform—before performing the analysis. This approach needs a cascading series of costs. Every single byte of data generated at the edge triggers charges that inflate your infrastructure spending, like:

Data egress costs: Incurred when moving data out of its source environment.
Data ingestion costs: Fees charged by your observability platform, log aggregation service, or data warehouse to receive the incoming data.
Data storage costs: Ongoing expenses for retaining the data, frequently stored in expensive hot or warm storage tiers designed for rapid access.
Query costs: Queries are not free. At least, they have a computational cost that impacts computational time. If query time is not sufficient, the cost is generally spread to more hardware or other computing solutions.

Industry analyses indicate that a substantial percentage of data—19.2% Compound Annual Growth Rate (CAGR) between 2018 and 2022—is never queried or utilized for action after being stored. These statistics—projected to increase to 21.3% CAGR between 2023 and 2033—tell us that organizations acquire data and store it, but do not use 20% of it. Still, they bear the full cost for every byte.

This represents more than just a financial waste. It is an operational burden. When you move large datasets to a central location, you have to face challenges like:

High data transfer costs: Moving large volumes of data across networks (especially to the cloud) incurs significant bandwidth and storage costs.
Latency and time: Transferring terabytes or petabytes of data can take hours or days, delaying insights.
Engineering complexity: Integrating heterogeneous data sources and maintaining data pipelines is resource-intensive.
Security and compliance risks: Moving sensitive data increases exposure to regulatory and privacy challenges.

Considering those challenges in a scenario where 20% of your data will never be used, you can understand that something should change.

Why Raw Data Overwhelms Hot Storage and Cloud Infrastructure

Hot storage tiers available in cloud data warehouses or observability platforms provide rapid data access, but they come at a premium price. For example, currently Azure Blob Storage charges $0.018 per GB per month for the first 50 terabytes.

Cloud object storage is an economical alternative, but it remains an inefficient repository for data that will remain unanalyzed. You have to say it out loud: if data is known to be redundant or requires transformation before providing any utility, forcing it through the entire costly "ship-and-store" pipeline lacks logical and financial sense.

However, knowing that some data is redundant requires at least data preprocessing—and, eventually, even analytics—before aggregating.

But what if you do not have the needed computational power? Ultimately, you are shipping everything to a central location for this reason! Here we are at the central point of the issue.

The Root Cause: Why Teams Are Forced to Ship All Data

So, why does this inefficient pattern of shipping all data persist so widely? The primary reason is that most existing data pipelines and associated tooling provide no viable alternative. The current landscape of data tools operates under the assumption of reliable, high-bandwidth network connections, virtually unlimited storage capacity, and boundless budgets.

Consider these common tools and their typical roles in this pattern:

Log shippers: They collect logs locally. Their primary function is forwarding, so they send raw processed data upstream for further analysis.
Message queues: Excellent for buffering high-volume streams. They act as central conduits, meaning data passes through them untransformed.
Observability platforms: Powerful analysis tools, but value scales with ingested data. Their pricing models incentivize sending more data centrally as they expect data to arrive for processing.
Data Warehouses/Lakes: Designed for massive central storage and analysis. They require data to be landed first, and processing happens after storage.

These systems are architected around the principle of centralizing data before processing. Also, engineering and operations teams often can not execute computation directly where the data is generated for a simple reason: they do not have sufficient computing resources at the network edge. Because of that, teams are forced to accept the resultant increases in cost and complexity.

Introducing a Smarter Architecture: Compute-Over-Data

The alternative to this costly paradigm is a shift towards Compute-Over-Data. This approach inverts the traditional data flow model by bringing computation to the data source, enabling data processing at the node and edge level.

Imagine the capability to execute specific processing logic— filtering, data transformation, enrichment with local context, compression, even analytical tasks—directly on the servers generating logs, within the IoT gateway aggregating sensor data, or alongside the application producing telemetry streams. This allows you to make decisions about your data before it leaves its point of origin and enters the costly segments of your pipeline. It also allows you to decide if the data needs to move further or not.

Applying the Compute-Over-Data paradigm enables you to:

Filter verbose logs, retaining only critical error states or significant security events.
Compress data using algorithms tailored to the specific data type or domain.
Enrich events with local context (e.g., instance IDs, geo-location) before they are centralized.
Route data based on its content or the outcome of edge processing.

This data processing occurs before the data incurs significant costs associated with costly data movements. Also, the movements can be done only on the data that is needed. A shifting paradigm that saves you money and time!

Adopting the Compute-Over-Data strategy provides organizations with significant advantages like:

Reduced data costs: You can reduce data costs by avoiding the shipment and storage of unnecessary, low-value data.
Improved signal visibility: By filtering out noise early in the pipeline, you enhance the signal-to-noise ratio, making it considerably easier to identify anomalies within your systems.
Compliance and data security: Maintaining sensitive data as it resides increases data security. Only transmit anonymized, aggregated, or specifically required subsets of data, improving security posture and compliance adherence.

Bacalhau: The Solution for Implementing Compute-Over-Data

The shift towards Compute-Over-Data is not a theoretical concept. It is a practical reality achievable with Bacalhau. Bacalhau is an open-source framework designed to execute compute jobs where data resides, breaking the dependency on centralized data processing architectures.

Consider Bacalhau as a coordination and orchestration layer capable of managing computations packaged as Docker containers or WASM modules across a distributed network of compute nodes. These nodes can encompass your data center servers, edge devices deployed in the field, or workstations with GPUs. Instead of dealing with the inefficient process of pulling massive datasets to a central compute cluster, Bacalhau deploys the computation where the data is.

Here is how Bacalhau addresses the pain points and costs associated with traditional data pipelines:

Enables edge processing: Bacalhau allows you to execute tasks at the data source, before incurring costly network egress and platform ingestion fees—if needed. This is a key element for effective edge computing, for example.
Reduces data volume and cost: Execute filtering, aggregation, or downsampling jobs via Bacalhau on raw logs or metrics before transmission. This cuts the volume of data sent to expensive downstream systems, data warehouses, or observability platforms, leading to cost savings.
Processes different kinds of jobs: Bacalhau supports batch (for tasks that run to completion on a specified number of nodes), service (for jobs that run continuously on a specified number of nodes), ops (for running jobs that are executed on all nodes that align with the job specification, but otherwise behave like batch jobs), and daemon (for jobs that run continuously on all nodes that meet the criteria given in the job specification) jobs. This means you do not need to care about the type of jobs you are computing.
Disconnected execution: Bacalhau is designed with disconnected or intermittently connected environments in mind, like edge environments. In particular, it provides a robust execution framework that allows task execution without requiring persistent network connections to a central controller. So, no more fear of retrying jobs if nodes lose connections.

With Bacalhau, you can apply compute resources at the source and make decisions about what data to ship, what to store long-term, and what to discard.

Reduced data costs: You can reduce data costs by avoiding the shipment and storage of unnecessary, low-value data.
Improved signal visibility: By filtering out noise early in the pipeline, you enhance the signal-to-noise ratio, making it considerably easier to identify anomalies within your systems.
Compliance and data security: Maintaining sensitive data as it resides increases data security. Only transmit anonymized, aggregated, or specifically required subsets of data, improving security posture and compliance adherence.

Conclusion

The traditional "ship everything" approach to data pipelines is unsustainable, burdening organizations with different kinds of costs.

A more intelligent and cost-effective paradigm exists: Compute-Over-Data. By bringing computation directly to the data source, organizations can filter, transform, aggregate, and analyze information before it enters expensive pipeline stages. This approach reduces the volume of data transferred and stored, and cuts associated costs.

The solution to implement this paradigm is Bacalhau. It provides the orchestration layer needed to deploy containerized or WASM-based compute jobs across distributed nodes, directly where data resides.

What's Next?

To start using Bacalhau, install Bacalhau and give it a shot.

However, if you don’t have a network and you would still like to try it out, we recommend using Expanso Cloud. Also, if you would like to set up a cluster on your own, you can do that too (we have setup guides for AWS, GCP, Azure, and many more 🙂).

Get Involved!

We welcome your involvement in Bacalhau. There are many ways to contribute, and we’d love to hear from you. Please reach out to us at any of the following locations:

Commercial Support

While Bacalhau is open-source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open-source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us or get your license on Expanso Cloud!

Ready to get started?

Create an account instantly to get started or contact us to design a custom package for your business.

Start Now Contact Sales

Always know what you pay

Straightforward per-node pricing with no hidden fees.

Pricing Details

Start your journey

Get up and running in as little as
5 minutes

Start Building

Backed by leading venture firms