Building a Real-Time Crypto Analytics Platform

From Streaming Pipeline to Analytics Platform

When I started learning data engineering, I wanted a project that would force me to use the technologies I was studying in a realistic setting. Tutorials are useful for understanding concepts, but there is a significant difference between following a guided example and designing a system from scratch.

I decided to build a real-time cryptocurrency analytics platform.

At the time, I thought the project would be relatively straightforward:

CoinGecko API
    ↓
Kafka
    ↓
Spark
    ↓
PostgreSQL

The goal was simple. Stream cryptocurrency market data, process it in real time, and store the results in a database.

What I didn't realise was that the most valuable lessons would come from everything that happened after that initial pipeline worked.

Project Resources

The complete source code, documentation, architecture diagrams, deployment instructions, and dashboard designs for this project are available on GitHub:

GitHub: Repository

The repository includes:

Apache Kafka streaming pipelines
Spark Structured Streaming transformations
PostgreSQL operational storage
Neon PostgreSQL warehouse design
Apache Airflow orchestration workflows
Monitoring and validation systems
Dashboard architecture and documentation

Choosing the Problem

I chose cryptocurrency data for three reasons.

First, it is naturally real-time. Prices, trading volume and market activity are continuously changing, making it an excellent candidate for streaming architectures.

Second, high-quality public APIs are available, allowing me to focus on engineering challenges rather than data collection.

Finally, cryptocurrency markets create interesting analytical questions. Beyond simple price tracking, I wanted to explore how market behaviour might relate to social sentiment and public discussion.

The project eventually expanded to include both market data and sentiment analysis from YouTube comments discussing cryptocurrency topics.

The Initial Architecture

The first version of the platform focused entirely on streaming market data.

Historical cryptocurrency prices are collected from the CoinGecko API and streamed through Apache Kafka. Spark Structured Streaming processes incoming events, calculates rolling metrics, and publishes the results to downstream consumers.

The processed metrics are stored in PostgreSQL for operational reporting and persistence.

The core technologies included:

Apache Kafka
Apache Spark Structured Streaming
PostgreSQL
Docker Compose

At this stage, I believed I had built a complete data pipeline.

I was wrong.

Expanding Beyond Streaming

As the project grew, I began adding additional analytical requirements.

I wanted to:

analyse cryptocurrency-related YouTube discussions
calculate sentiment metrics
build dashboards
compare market activity with social sentiment
create automated monitoring and validation workflows

Each new requirement exposed limitations in my original design.

The project gradually evolved from a streaming pipeline into a broader analytics platform.

The resulting architecture looks very different from the system I originally set out to build.

The Architecture Today

The platform now consists of multiple layers.

Streaming Layer

Real-time cryptocurrency prices and YouTube comments are ingested through Kafka topics.

Spark Structured Streaming processes incoming events and performs real-time transformations and aggregations.

Operational Storage Layer

Processed outputs are stored in PostgreSQL.

This layer acts as the operational persistence layer for streaming data.

Analytical Warehouse Layer

Daily aggregates are loaded into a dedicated analytical warehouse hosted on Neon PostgreSQL.

The warehouse uses a dimensional model consisting of:

fact tables
dimension tables
dashboard-facing analytical views

Monitoring Layer

Apache Airflow orchestrates warehouse loading workflows and performs platform-wide health checks.

The monitoring system validates:

pipeline freshness
data quality
warehouse population
dashboard readiness

Analytics Layer

Looker Studio dashboards provide reporting for:

cryptocurrency market metrics
social sentiment metrics
exploratory sentiment-market relationships

The Biggest Lessons

Building the platform taught me far more than how to configure individual technologies.

Several assumptions I had at the beginning turned out to be incorrect.

A Database Is Not a Data Warehouse

I already understood in theory that operational databases and analytical warehouses serve different purposes, but during the project it became clear when that distinction starts to matter in practice.

Initially, PostgreSQL was sufficient for storing and querying processed streaming data. However, as I added dashboards, historical analysis, and more complex reporting requirements, I found myself needing structures that were designed specifically for analytics rather than operational workloads.

That practical need led to the introduction of a dedicated warehouse layer, dimensional modelling, and dashboard-facing analytical views.

Streaming Is Not Batch Processing

Coming from a Python and pandas background, I initially approached Spark Structured Streaming as if it were simply batch processing performed continuously.

Attempting to use traditional analytical techniques quickly revealed that streaming systems require a different way of thinking.

State management, event windows and streaming constraints fundamentally shape how transformations must be designed.

Dashboards Depend on Data Models

Several dashboard issues initially appeared to be visualisation problems.

In reality, they were data modelling problems.

Analytical grain, timestamp selection and warehouse design had a much greater impact on dashboard behaviour than any chart configuration.

Monitoring Matters

A pipeline that runs is not necessarily a pipeline that works.

One of the most valuable additions to the project was the monitoring layer.

Freshness checks, quality validation and warehouse integrity testing significantly improved confidence in the platform's outputs and helped surface issues that would otherwise have gone unnoticed.

What the Platform Does Today

Today, the platform provides:

real-time cryptocurrency market processing
YouTube sentiment analysis
operational data persistence
analytical warehousing
dashboard reporting
automated monitoring and validation

The project uses:

Python
SQL
Apache Kafka
Apache Spark Structured Streaming
PostgreSQL
Neon PostgreSQL
Apache Airflow
Docker Compose
Looker Studio

Looking Ahead

The next phase of development will focus on expanding the platform's data engineering capabilities.

Current areas of interest include:

MongoDB integration for sentiment data
dbt for warehouse modelling and transformation management
additional sentiment sources
more advanced analytical workflows

Final Thoughts

I started this project to improve my knowledge of Kafka and Spark.

What I ended up learning was much broader.

The most important lessons were not about individual technologies. They were about data modelling, system design, monitoring, analytical thinking and the realities of building systems that continue working after the initial implementation.

The project began as a streaming pipeline.

It became an analytics platform.

More importantly, it taught me that building data systems is far less about individual technologies and far more about understanding how data moves, evolves and is ultimately used.

And in the process, it became the most valuable learning experience of my transition into data engineering.