I Thought I Understood Databases — Then I Built One Properly

I thought I understood databases.
I had written SQL queries, built tables, and used them in projects. But everything I knew was limited to querying data - not managing it.
So I decided to build a complete PostgreSQL system from scratch. And that’s when things started breaking.
The Approach
I started with a retail dataset from Kaggle to mimic something closer to a real-world system - a database that is actively used, updated, and queried.
Instead of treating this as a simple data storage exercise, I approached it as if I were setting up a small production database. That meant making decisions not just about how to store data, but how to structure, protect, and maintain it over time.
Repository: Retail Database System
What I Learned
Lesson 1: Schema design is more than just creating a table
The first step was designing the schema properly.
Initially, I thought this would be straightforward, but I quickly ran into issues. For example, I defined customer_id as an integer, only to realise the dataset used alphanumeric values like 'CUST001'.
This forced me to think more carefully about data types and structure.
I also considered normalisation principles, particularly Third Normal Form (3NF), which aims to reduce redundancy and ensure that each piece of data is stored in a logical, non-repetitive way. While the dataset was relatively simple, this mindset influenced how I structured the table and prepared for potential extensions.
Later on, I explored partitioning the data by transaction date. This isn’t strictly part of normalisation, but it reflects how real systems evolve - separating large datasets into manageable segments for performance and maintenance.
Lesson 2: A backup is only useful if you can restore it
Once the data was in place, I started thinking about reliability.
If this were a real retail system, data would be constantly updated throughout the day. Losing even a few hours of data could be costly, so relying on a single daily or weekly backup wouldn’t be sufficient.
To begin with, I implemented a simple backup strategy using PostgreSQL’s pg_dump, generating timestamped backups through a shell script. But more importantly, I tested the recovery process by restoring the backup into a separate database. This was a key moment for me - creating a backup is easy, but knowing it actually works is what makes it valuable.
As I thought more about how a real system would operate, I realised that a more robust approach would be needed. In practice, I would design a hybrid backup strategy combining periodic full backups (for example, weekly) with continuous Write-Ahead Logging (WAL), enabling point-in-time recovery.
This would allow the database to be restored not just to the last backup, but to a specific moment before a failure occurred; significantly reducing potential data loss.
Even though this project only implemented simple logical backups, thinking in terms of recovery objectives --how much data can be lost and how quickly the system must recover-- made the problem feel much more realistic.
Lesson 3: Query optimisation is about plans, not time
I also wanted to understand how PostgreSQL actually executes queries in practice.
Using EXPLAIN ANALYZE, I compared query plans before and after adding an index.
Because the dataset was relatively small, the execution time difference was appreciable, but not significant . However, the more important insight came from the query plan itself.
Before indexing, PostgreSQL performed a sequential scan across the entire table, filtering out non-matching rows. After indexing, it switched to an index scan, directly locating the relevant rows.
This highlighted something I hadn’t fully appreciated before: query optimisation is less about the absolute execution time in small test environments, and more about how the database chooses to access data. As datasets grow, these differences in execution strategy become critical for performance.
It’s also worth noting that the timings reported by EXPLAIN ANALYZE primarily reflect execution within PostgreSQL itself, and do not always capture external factors such as client-side latency. This reinforces the idea that understanding query plans is more meaningful than focusing on raw timing alone.
Lesson 4: Databases don’t clean up after themselves
Another aspect I hadn’t fully appreciated before this project was database maintenance.
PostgreSQL uses a model where updates and deletes don’t immediately remove old data, but instead create “dead tuples”. Over time, these can accumulate and affect performance. This is why processes like VACUUM exist to reclaim space and keep the database efficient.
In practice, this is handled automatically by PostgreSQL’s autovacuum system, which runs in the background to maintain performance without manual intervention. But understanding what it does (and why it matters) gave me a much better sense of how databases behave over time and what it means to “maintain” a database.
Lesson 5: You only understand a system when it breaks
When I implemented partitioning, I hit an error I didn’t expect.
PostgreSQL uses inclusive lower bounds and exclusive upper bounds for partitions. A single date ('2024-01-01') didn’t fit anywhere and caused the insert to fail.
That’s not something you really appreciate until it breaks your code.
Final Thoughts
This project didn’t just improve my familiarity with PostgreSQL - it changed how I approach databases.
I no longer see them as tables to query, but as systems that need to be designed, maintained, and relied on over time. From schema decisions to backup strategies and performance considerations, everything is connected.
More than anything, it made me realise that understanding databases isn’t about knowing the syntax, it’s about understanding how they behave in practice.





