Skip to main content

Command Palette

Search for a command to run...

Week 1: Building the Dataset (Harder Than Expected)

Updated
6 min read
Week 1: Building the Dataset (Harder Than Expected)
A
Chemistry PhD transitioning into data engineering. Building data systems with Python, SQL, and end-to-end pipelines, while exploring databases, analytics, and data workflows.

We’re still figuring out what this project is going to become.

I’m working with Zoe (data scientist), and my role is to build the dataset — daily stock data for a selection of energy companies so we can explore how different parts of the sector behave. Oil, renewables, material providers (supply chain), emerging tech, etc. The goal is to get enough contrast to make the analysis interesting.

I assumed getting a test dataset for week 1 would be easy. I mean, it’s stock data — how hard can it be to access something so… accessible?

Well… as with everything else in data engineering, things didn’t quite work out on the first attempt.

First attempt: Yahoo Finance

Naturally, I started with Yahoo Finance.

It felt like the obvious choice — widely used, simple interface, and there’s a Python library (yfinance) that lets you pull data in a couple of lines. Ideal.

So I did what any reasonable person would do: I tried to download everything in one go.

And for a moment… it worked.

Then it didn’t.

Some tickers returned data, others didn’t. Some gave me empty DataFrames. Others came back with messages like “possibly delisted” — which is a bit alarming when you’re looking at companies like ExxonMobil.

At first, I thought it was something I was doing wrong. Maybe the tickers were off, maybe the request was too big, maybe I needed to tweak something in the parameters.

So I tried:

  • downloading tickers one by one

  • changing the request structure

  • checking the outputs more carefully

Same result. Inconsistent behaviour, partial data, and no clear explanation.

At that point, it became clear this wasn’t a coding issue.

It was a data source reliability issue.

And that was the first small reality check of the project: just because a tool is popular doesn’t mean it’s dependable — especially when you start scaling beyond a quick test.

Second attempt: Alpha Vantage

So I moved on to something a bit more “serious”: Alpha Vantage.

This time, I decided to do things properly — no shortcuts, no black-box libraries. Just build the pipeline myself and understand what’s actually going on.

That meant:

  • constructing API requests manually

  • sending them with requests.get()

  • parsing JSON responses

  • turning them into a usable table

And honestly, this part was fun. It felt like I finally understood what tools like yfinance were doing behind the scenes.

For the first time, everything worked exactly as expected:

  • I could pull the data

  • reshape it

  • clean it

  • loop over multiple tickers

It was clean, controlled, and — most importantly — reproducible.

So naturally, I thought: great, problem solved.

Then I checked the data.

About 100 rows per ticker.

Which, for a project that’s supposed to include daily stock data since 2019, is… not right.

Turns out the free version of Alpha Vantage:

  • only returns a limited number of recent observations

  • has strict rate limits

  • and locks full historical data behind a paid plan

So technically, I had built a working pipeline. But practically, I didn’t have the data I needed. And that was the second, slightly bigger realisation:

working code doesn’t mean you have usable data.

At that point, it felt like I was solving the wrong problem. The pipeline was fine — the source wasn’t.

The turning point: Stooq

I stopped trying to force APIs and took a step back. The problem wasn’t my code anymore. It was the data source.

So instead of asking “which API should I use?”, I asked a simpler question:

Where can I actually get the data I need?

That’s how I ended up on Stooq.

On paper, it was exactly what I was looking for:

  • full historical data

  • daily frequency

  • simple tabular format

No rate limits, no JSON parsing, no API keys (at least in theory).

The only catch: the interface is… not great. And it's in Polish.

At first, I ran into:

  • confusing buttons

  • misleading “Download” links (ads everywhere)

  • and a couple of attempts where I accidentally tried to read HTML as CSV (which, unsurprisingly, did not go well)

But once I figured out how to access the actual CSV endpoint directly, everything became a lot simpler.

The solution

I downloaded the data for each ticker from Stooq and built a clean ingestion pipeline on top of it.

The workflow was straightforward:

  • store all CSVs in a folder

  • read them programmatically

  • extract the ticker from the filename

  • combine everything into a single dataset

  • clean and structure the data

From there, I added a couple of things to make it analysis-ready:

  • converted dates to proper datetime format

  • sorted the data by ticker and date

  • mapped each company to a sector

  • computed daily returns

I ran checks at each step to make sure the data looked right.

One of them was simply counting the number of rows per ticker, just to confirm that each download had worked properly. That’s when I noticed something interesting: not all companies had the same history length.

QuantumScape, for example, had fewer observations. I double-checked, and it made sense — it’s a newer company and only became publicly listed later on.

At that stage, I finally had what I needed: a clean, consistent dataset covering all companies from 2019 to today.

What I actually learned

The main takeaway from this week wasn’t about pandas or Python. It was realising that building the pipeline was only part of the task I had to carry out.

In this case, the harder part was finding a data source that actually fit what I needed — reliable, complete, and usable at the scale I had in mind.

Once that was in place, the rest of the pipeline became relatively straightforward.

What's next

The dataset is now ready for analysis. The current version automates processing once the raw files are available, but the extraction step is still manual. A later iteration should replace that with a fully automatable data source.

Next week, Zoe will start working with it, and we’ll see what kind of questions come out of it — and whether I need to refine or extend the pipeline.

Figuring Out a Market Analysis Project

Part 1 of 1

A week-by-week log of building a market analysis project from scratch — a data engineer and a data scientist collaborating, figuring things out as we go.