Reimagining the Codebase
Dear Data Teams,
In a world where you’re expected to derive insights, support business logic, and drive innovation, we understand the unique challenges you face with existing codebases.
Data teams need specialized platforms for their code processes. JupyterHub offers a dedicated IDE for data teams. However, there is unfortunately no suitable codebase for the code processes after the IDE.
We believe that data teams need a dedicated codebase. Because accumulating know-how related to your code in data teams is very difficult. After conducting many experiments, you need to run many processes (configuration, packaging, etc.) while sending your work outputs to the codebase in a reusable way.
As a result of hundreds of interviews we conducted, we discovered that data teams struggle with collaborative work. The reason data teams struggle with collaborative work was not because of their poor development skills, but because the codebase systems they used were designed for developers.
Based on the interviews we conducted, we developed principles for a new codebase for data teams:
It should be IDE-agnostic
The codebase should support the ipynb notebooks that data teams work with without requiring any additional tools. Data teams should be able to share their functions without changing their working environments.
It should be easy to use
Traditional codebases are made for developers, and working collaboratively on these codebases requires high development knowledge. However, the main job of data teams is to extract insights from appropriate data and support business logic. Data teams are not developers, they only need to do development as part of their job. We believe that the codebase should inherently support collaborative work. It should solve configurations, packaging, and many other things in the background by default.
Codes should be stored in a portable, shareable, and reusable way
Each software piece should be able to run on its own and be easily transferred. The codebase should automatically handle configurations and package dependencies when sending codes, and when loaded by someone else, it should work just like a dockerized component. This way, data teams can easily share and reuse their functions across different environments.
The codebase should be able to present the data it holds in a meaningful way
The codebase should be able to store user statistics, static and dynamic dependency analyses, and hardware data from runtime.
The codebase should host artificial intelligence agents
The codebase should have an internal LLM and teams of agents that will execute various tasks, automate processes, and improve codes. The codebase should have an internal playground environment. This environment should facilitate testing and trying new ideas by the agents in the codebase and developers. The codebase should be able to write details that need to be explained by humans (commit messages and release notes) in a fully automated manner, and also establish relationships between them.