Skip to main content

Command Palette

Search for a command to run...

Building a Real-Time Crypto Analytics Platform

Updated
6 min read
Building a Real-Time Crypto Analytics Platform
A
Chemistry PhD transitioning into data engineering. Building data systems with Python, SQL, and end-to-end pipelines, while exploring databases, analytics, and data workflows.

From Streaming Pipeline to Analytics Platform

When I started learning data engineering, I wanted a project that would force me to use the technologies I was studying in a realistic setting. Tutorials are useful for understanding concepts, but there is a significant difference between following a guided example and designing a system from scratch.

I decided to build a real-time cryptocurrency analytics platform.

At the time, I thought the project would be relatively straightforward:

CoinGecko API
    ↓
Kafka
    ↓
Spark
    ↓
PostgreSQL

The goal was simple. Stream cryptocurrency market data, process it in real time, and store the results in a database.

What I didn't realise was that the most valuable lessons would come from everything that happened after that initial pipeline worked.


Project Resources

The complete source code, documentation, architecture diagrams, deployment instructions, and dashboard designs for this project are available on GitHub:

GitHub: Repository

The repository includes:

  • Apache Kafka streaming pipelines

  • Spark Structured Streaming transformations

  • PostgreSQL operational storage

  • Neon PostgreSQL warehouse design

  • Apache Airflow orchestration workflows

  • Monitoring and validation systems

  • Dashboard architecture and documentation


Choosing the Problem

I chose cryptocurrency data for three reasons.

First, it is naturally real-time. Prices, trading volume and market activity are continuously changing, making it an excellent candidate for streaming architectures.

Second, high-quality public APIs are available, allowing me to focus on engineering challenges rather than data collection.

Finally, cryptocurrency markets create interesting analytical questions. Beyond simple price tracking, I wanted to explore how market behaviour might relate to social sentiment and public discussion.

The project eventually expanded to include both market data and sentiment analysis from YouTube comments discussing cryptocurrency topics.


The Initial Architecture

The first version of the platform focused entirely on streaming market data.

Historical cryptocurrency prices are collected from the CoinGecko API and streamed through Apache Kafka. Spark Structured Streaming processes incoming events, calculates rolling metrics, and publishes the results to downstream consumers.

The processed metrics are stored in PostgreSQL for operational reporting and persistence.

The core technologies included:

  • Apache Kafka

  • Apache Spark Structured Streaming

  • PostgreSQL

  • Docker Compose

At this stage, I believed I had built a complete data pipeline.

I was wrong.


Expanding Beyond Streaming

As the project grew, I began adding additional analytical requirements.

I wanted to:

  • analyse cryptocurrency-related YouTube discussions

  • calculate sentiment metrics

  • build dashboards

  • compare market activity with social sentiment

  • create automated monitoring and validation workflows

Each new requirement exposed limitations in my original design.

The project gradually evolved from a streaming pipeline into a broader analytics platform.

The resulting architecture looks very different from the system I originally set out to build.


The Architecture Today

The platform now consists of multiple layers.

Streaming Layer

Real-time cryptocurrency prices and YouTube comments are ingested through Kafka topics.

Spark Structured Streaming processes incoming events and performs real-time transformations and aggregations.

Operational Storage Layer

Processed outputs are stored in PostgreSQL.

This layer acts as the operational persistence layer for streaming data.

Analytical Warehouse Layer

Daily aggregates are loaded into a dedicated analytical warehouse hosted on Neon PostgreSQL.

The warehouse uses a dimensional model consisting of:

  • fact tables

  • dimension tables

  • dashboard-facing analytical views

Monitoring Layer

Apache Airflow orchestrates warehouse loading workflows and performs platform-wide health checks.

The monitoring system validates:

  • pipeline freshness

  • data quality

  • warehouse population

  • dashboard readiness

Analytics Layer

Looker Studio dashboards provide reporting for:

  • cryptocurrency market metrics

  • social sentiment metrics

  • exploratory sentiment-market relationships


The Biggest Lessons

Building the platform taught me far more than how to configure individual technologies.

Several assumptions I had at the beginning turned out to be incorrect.

A Database Is Not a Data Warehouse

I already understood in theory that operational databases and analytical warehouses serve different purposes, but during the project it became clear when that distinction starts to matter in practice.

Initially, PostgreSQL was sufficient for storing and querying processed streaming data. However, as I added dashboards, historical analysis, and more complex reporting requirements, I found myself needing structures that were designed specifically for analytics rather than operational workloads.

That practical need led to the introduction of a dedicated warehouse layer, dimensional modelling, and dashboard-facing analytical views.

Streaming Is Not Batch Processing

Coming from a Python and pandas background, I initially approached Spark Structured Streaming as if it were simply batch processing performed continuously.

Attempting to use traditional analytical techniques quickly revealed that streaming systems require a different way of thinking.

State management, event windows and streaming constraints fundamentally shape how transformations must be designed.

Dashboards Depend on Data Models

Several dashboard issues initially appeared to be visualisation problems.

In reality, they were data modelling problems.

Analytical grain, timestamp selection and warehouse design had a much greater impact on dashboard behaviour than any chart configuration.

Monitoring Matters

A pipeline that runs is not necessarily a pipeline that works.

One of the most valuable additions to the project was the monitoring layer.

Freshness checks, quality validation and warehouse integrity testing significantly improved confidence in the platform's outputs and helped surface issues that would otherwise have gone unnoticed.


What the Platform Does Today

Today, the platform provides:

  • real-time cryptocurrency market processing

  • YouTube sentiment analysis

  • operational data persistence

  • analytical warehousing

  • dashboard reporting

  • automated monitoring and validation

The project uses:

  • Python

  • SQL

  • Apache Kafka

  • Apache Spark Structured Streaming

  • PostgreSQL

  • Neon PostgreSQL

  • Apache Airflow

  • Docker Compose

  • Looker Studio


Looking Ahead

The next phase of development will focus on expanding the platform's data engineering capabilities.

Current areas of interest include:

  • MongoDB integration for sentiment data

  • dbt for warehouse modelling and transformation management

  • additional sentiment sources

  • more advanced analytical workflows


Final Thoughts

I started this project to improve my knowledge of Kafka and Spark.

What I ended up learning was much broader.

The most important lessons were not about individual technologies. They were about data modelling, system design, monitoring, analytical thinking and the realities of building systems that continue working after the initial implementation.

The project began as a streaming pipeline.

It became an analytics platform.

More importantly, it taught me that building data systems is far less about individual technologies and far more about understanding how data moves, evolves and is ultimately used.

And in the process, it became the most valuable learning experience of my transition into data engineering.

Building a Crypto Analytics Platform

Part 1 of 2

A series documenting the design, architecture, debugging challenges, and lessons learned while building a real-time cryptocurrency analytics platform using Kafka, Spark Structured Streaming, PostgreSQL, Airflow, and modern analytics engineering practices.

Up next

The Moment I Realised a Database Is Not a Data Warehouse

This article is part of a series documenting the development of a real-time crypto analytics platform.

Building a Real-Time Crypto Analytics Platform