Impact, Use Cases, and How to Share your Work
Kapitalo Investimentos
It was 2015, and I was pursuing a Master’s degree in Economics.
At the time, my plan was to build an academic career.
Teachers went on strike!
My first thought: spend all day at the beach.
As tempting as it was, I needed to be responsible.
Back then, programming was mostly something for CS people.
Free content was scarce, and worst of all — no ChatGPT! Luckily, Stack Overflow had my back.
I’m currently leading the Data Science team within the Economic Research department at Kapitalo.
My background involves the use of quantitative methods and data science in the context of macroeconomic research.
Data available in (almost) real time.
Greater granularity.
But often messy!
With the right skills, however, we can extract value (we’ll see more later).
The COVID period accelerated the adoption of these sources.
Adoption of methods like Random Forest, LASSO, XGBoost, etc.
Advantages: variable selection, flexible functional form, nonlinearity, and improved accuracy.
Ideal for large datasets.
Traditional (closed-source) software is slow to incorporate these methods.
New methods are mostly developed in open-source languages — typically Python and/or R (more on this later).
Sample of over 5,000 papers in the economics and finance literature between 1990 and 2021 containing selected keywords.
Source: Warin & Stojkov, 2021. Journal of Financial Risk Management
A series of copy-and-paste steps.
Limited ability to scale.
Reproducibility issues are common.
Source: https://r4ds.had.co.nz/introduction.html
Integrated environments create direct and efficient communication between tasks.
Continuous updates and packages releases bring new features and improvements.
Better error handling and debugging.
Open-source software usually runs smoothly on any OS.
Better forecasting methods and the ability to handle more data sources \(\rightarrow\) improved accuracy.
Automation: models, reports, and scenario revisions.
Scale: analyses can be easily expanded to cover more countries/industries.
But, creating an automation is one thing — making sure everything runs smoothly every day is another.
Production is often seen as a distant ideal — built for big infrastructure, not simple scripts.
But as Alex K. Gold brilliantly puts it: “If you’re a data scientist putting your work in front of someone else’s eyes, you’re in production.”
Packages are updated frequently.
Sometimes, functions from a previous version behave differently or are no longer available.
We need to make sure that the user (or our future self) can use the same library we used when writing the code.
The renv
package was built exactly for this purpose.
When scraping data from websites, for example, we often need to interact with a web browser and its extensions. If something changes, the workflow may fail.
In such cases, Docker
is the go-to tool. It lets us build a self-contained environment — called an image — with the desired OS and all necessary software.
This is also useful for hosting applications like Shiny apps and Plumber APIs.
An additional benefit is that most cloud providers support deploying applications directly from Docker images.
Processes that are not related to the upcoming data are left unchanged. This is critical for ensuring efficiency.
Numbers: > 60 variables; 18 equations; > 100 targets. Handling this manually is inefficient, error-prone, and doesn’t scale well.
FGV/EAESP - May 21st, 2025. Slides available at: http://eaesp2025.rleripio.com. Built with Quarto.