Building a Market Analysis Project with Python

We’re still figuring out what this project is going to become.

I’m working with Zoe (data scientist), and my role is to build the dataset — daily stock data for a selection of energy companies so we can explore how different parts of the sector behave. Oil, renewables, material providers (supply chain), emerging tech, etc. The goal is to get enough contrast to make the analysis interesting.

I assumed getting a test dataset for week 1 would be easy. I mean, it’s stock data — how hard can it be to access something so… accessible?

Well… as with everything else in data engineering, things didn’t quite work out on the first attempt.

First attempt: Yahoo Finance

Naturally, I started with Yahoo Finance.

It felt like the obvious choice — widely used, simple interface, and there’s a Python library (yfinance) that lets you pull data in a couple of lines. Ideal.

So I did what any reasonable person would do: I tried to download everything in one go.

And for a moment… it worked.

Then it didn’t.

Some tickers returned data, others didn’t. Some gave me empty DataFrames. Others came back with messages like “possibly delisted” — which is a bit alarming when you’re looking at companies like ExxonMobil.

At first, I thought it was something I was doing wrong. Maybe the tickers were off, maybe the request was too big, maybe I needed to tweak something in the parameters.

So I tried:

downloading tickers one by one
changing the request structure
checking the outputs more carefully

Same result. Inconsistent behaviour, partial data, and no clear explanation.

At that point, it became clear this wasn’t a coding issue.

It was a data source reliability issue.

And that was the first small reality check of the project: just because a tool is popular doesn’t mean it’s dependable — especially when you start scaling beyond a quick test.

Second attempt: Alpha Vantage

So I moved on to something a bit more “serious”: Alpha Vantage.

This time, I decided to do things properly — no shortcuts, no black-box libraries. Just build the pipeline myself and understand what’s actually going on.

That meant:

constructing API requests manually
sending them with requests.get()
parsing JSON responses
turning them into a usable table

And honestly, this part was fun. It felt like I finally understood what tools like yfinance were doing behind the scenes.

For the first time, everything worked exactly as expected:

I could pull the data
reshape it
clean it
loop over multiple tickers

It was clean, controlled, and — most importantly — reproducible.

So naturally, I thought: great, problem solved.

Then I checked the data.

About 100 rows per ticker.

Which, for a project that’s supposed to include daily stock data since 2019, is… not right.

Turns out the free version of Alpha Vantage:

only returns a limited number of recent observations
has strict rate limits
and locks full historical data behind a paid plan

So technically, I had built a working pipeline. But practically, I didn’t have the data I needed. And that was the second, slightly bigger realisation:

working code doesn’t mean you have usable data.

At that point, it felt like I was solving the wrong problem. The pipeline was fine — the source wasn’t.

The turning point: Stooq

I stopped trying to force APIs and took a step back. The problem wasn’t my code anymore. It was the data source.

So instead of asking “which API should I use?”, I asked a simpler question:

Where can I actually get the data I need?

That’s how I ended up on Stooq.

On paper, it was exactly what I was looking for:

full historical data
daily frequency
simple tabular format

No rate limits, no JSON parsing, no API keys (at least in theory).

The only catch: the interface is… not great. And it's in Polish.

At first, I ran into:

confusing buttons
misleading “Download” links (ads everywhere)
and a couple of attempts where I accidentally tried to read HTML as CSV (which, unsurprisingly, did not go well)

But once I figured out how to access the actual CSV endpoint directly, everything became a lot simpler.

The solution

I downloaded the data for each ticker from Stooq and built a clean ingestion pipeline on top of it.

The workflow was straightforward:

store all CSVs in a folder
read them programmatically
extract the ticker from the filename
combine everything into a single dataset
clean and structure the data

From there, I added a couple of things to make it analysis-ready:

converted dates to proper datetime format
sorted the data by ticker and date
mapped each company to a sector
computed daily returns

I ran checks at each step to make sure the data looked right.

One of them was simply counting the number of rows per ticker, just to confirm that each download had worked properly. That’s when I noticed something interesting: not all companies had the same history length.

QuantumScape, for example, had fewer observations. I double-checked, and it made sense — it’s a newer company and only became publicly listed later on.

At that stage, I finally had what I needed: a clean, consistent dataset covering all companies from 2019 to today.

What I actually learned

The main takeaway from this week wasn’t about pandas or Python. It was realising that building the pipeline was only part of the task I had to carry out.

In this case, the harder part was finding a data source that actually fit what I needed — reliable, complete, and usable at the scale I had in mind.

Once that was in place, the rest of the pipeline became relatively straightforward.

What's next

The dataset is now ready for analysis. The current version automates processing once the raw files are available, but the extraction step is still manual. A later iteration should replace that with a fully automatable data source.

Next week, Zoe will start working with it, and we’ll see what kind of questions come out of it — and whether I need to refine or extend the pipeline.

Week 1: Building the Dataset (Harder Than Expected)

First attempt: Yahoo Finance

Second attempt: Alpha Vantage

The turning point: Stooq

The solution

What I actually learned

What's next

Comments

Figuring Out a Market Analysis Project

More from this blog

The Moment I Realised a Database Is Not a Data Warehouse

Building a Real-Time Crypto Analytics Platform

I Thought I Understood Databases — Then I Built One Properly

Building an ETL Pipeline in Bash (and What I Learned)

Command Palette

First attempt: Yahoo Finance

Second attempt: Alpha Vantage

The turning point: Stooq

The solution

What I actually learned

What's next

Comments

Figuring Out a Market Analysis Project

More from this blog