Why My Data Pipeline Needed a Data Warehouse

Context

The platform combines real-time cryptocurrency market data from the CoinGecko API with sentiment analysis derived from cryptocurrency-related YouTube discussions. Apache Kafka, Spark Structured Streaming, PostgreSQL, Airflow and Neon PostgreSQL are used to process, store and analyse the resulting data.

The full project overview can be found here:
Building a Real-Time Crypto Analytics Platform | GitHub Repository

Designing for Ingestion

When I built the first version of my crypto analytics platform, I thought the architecture was complete.

Data was flowing from the CoinGecko API into Kafka, Spark Structured Streaming was calculating rolling metrics, PostgreSQL was storing the results. The pipeline was working exactly as intended.

CoinGecko API
      ↓
Kafka
      ↓
Spark
      ↓
PostgreSQL

From an engineering perspective, I had achieved my original objective: ingest, process and store real-time data. What I didn't appreciate at the time was that I had designed the system entirely around data production. I hadn't really thought about data consumption. That distinction ended up changing the architecture of the entire platform.

The Architecture Was Optimised for the Wrong Problem

The operational PostgreSQL database worked perfectly. Spark wrote aggregated metrics. Queries returned results. Nothing was broken. The problem appeared when I started building dashboards.

I wanted to explore questions such as:

How do sentiment trends compare with market behaviour?
How does Bitcoin perform over longer time horizons?
Which metrics should be surfaced for reporting?
How can multiple datasets be analysed together?

The challenge wasn't collecting data anymore. The challenge was making that data easy to analyse. Those are very different problems.

Operational Models Reflect Pipelines

The operational schema naturally evolved around the requirements of the streaming system.

Tables were structured around:

ingestion events
streaming outputs
processing timestamps
operational persistence

This made perfect sense for the pipeline itself. The schema reflected how data moved through the system, but analytical questions are rarely organised around ingestion pipelines. They're organised around analytical concepts. Dates. Assets. Channels. Sentiment categories. Reporting periods.

As analytical requirements increased, I found myself repeatedly reshaping operational data into analytical structures before I could answer relatively simple questions. In effect, I was building a warehouse every time I wrote a reporting query.

The architecture was doing exactly what it had been designed to do. The problem was that the requirements had changed.

The Warehouse Was an Architectural Decision

At this point, the obvious question was:

Why not just keep adding more queries to PostgreSQL?

Technically, I could have.

The issue wasn't scale. The issue was responsibility.

The operational PostgreSQL database had been designed to support streaming ingestion and persistence. It was excellent at that job.

What I needed now was a dedicated analytical layer.

I wanted a system that could organise data around reporting requirements rather than ingestion workflows, provide a consistent analytical grain, simplify dashboard consumption, and support future analytical transformations.

I also had a practical problem to solve.

At the time, PostgreSQL was running locally inside Docker as part of the streaming platform. While this worked perfectly for development, it made downstream analytics more awkward. Connecting business intelligence tools directly to a local database introduces additional networking complexity and makes dashboards dependent on a developer machine being available.

I needed an analytical layer that was both architecturally separate and accessible to downstream reporting tools.

Neon turned out to be a natural fit.

Because Neon is PostgreSQL-compatible, I could reuse existing SQL skills and modelling approaches while introducing a cloud-hosted warehouse layer that Looker Studio could connect to directly.

The warehouse therefore solved two problems simultaneously.

It established a clear separation between operational and analytical workloads, while also making the analytical layer accessible outside the local development environment.

To populate the warehouse, I introduced a set of Airflow-orchestrated ELT workflows.

The operational PostgreSQL database remained the source of truth for streaming outputs. Airflow extracted the required data, transformed it into reporting-friendly aggregates, and loaded the results into analytical fact tables hosted in Neon.

For example, cryptocurrency metrics were aggregated by asset and date before being loaded into fact_crypto_price_daily, while sentiment metrics were grouped by date, channel, source query and sentiment label before being loaded into fact_youtube_sentiment_daily.

The streaming platform continued serving operational workloads.

The warehouse became responsible for analytics.

Designing for Analytics

Introducing the warehouse forced me to think differently about the data.

Until this point, most design decisions had been driven by how data moved through the system. Now I had to think about how people would actually consume it.

I modelled the warehouse as a simple star schema because the reporting requirements were aggregation-heavy and dashboard-driven. Most analytical queries involved filtering, grouping and comparing metrics across common dimensions such as dates, assets, channels and sentiment categories. The fact tables stored measurable metrics, while dimensions such as dates, channels, source queries and sentiment labels provided analytical context and simplified joins.

The goal was not to build the most sophisticated dimensional model possible. The goal was to make common analytical questions straightforward to answer.

Once the warehouse was in place, the architecture became much easier to reason about.

Streaming Layer
        ↓
Operational PostgreSQL
        ↓
Neon Warehouse
        ↓
Analytical Views
        ↓
Looker Studio

Each layer now had a distinct responsibility. The streaming layer produced data. The operational database persisted it. The warehouse organised it for analysis. The dashboards consumed it.

The most important change was not the introduction of a new technology. It was the separation of concerns.

The Real Lesson

Before this project, I thought of data warehouses primarily as a storage solution. After building the platform, I started thinking of them as an architectural boundary.

The warehouse wasn't introduced because the data volume became bigger. It wasn't introduced because PostgreSQL was insufficient. It was introduced because the questions being asked of the system had evolved. The architecture that was ideal for producing data was no longer ideal for consuming it. That distinction fundamentally changed how I think about analytics platforms.

Good data engineering isn't only about moving data from one place to another. It's about designing systems that allow data to become useful once it arrives. And sometimes that means recognising when a working architecture is no longer the right architecture.

The Moment I Realised a Database Is Not a Data Warehouse

Context

Designing for Ingestion

The Architecture Was Optimised for the Wrong Problem

Operational Models Reflect Pipelines

The Warehouse Was an Architectural Decision

Designing for Analytics

The Real Lesson

Comments

Building a Crypto Analytics Platform

Building a Real-Time Crypto Analytics Platform

More from this blog

Building a Real-Time Crypto Analytics Platform

Week 1: Building the Dataset (Harder Than Expected)

I Thought I Understood Databases — Then I Built One Properly

Building an ETL Pipeline in Bash (and What I Learned)

Command Palette

Context

Designing for Ingestion

The Architecture Was Optimised for the Wrong Problem

Operational Models Reflect Pipelines

The Warehouse Was an Architectural Decision

Designing for Analytics

The Real Lesson

Comments

Building a Crypto Analytics Platform

Building a Real-Time Crypto Analytics Platform

More from this blog