When working on a data science project we usually start with some raw data and we need to preprocess this to be able to feed it into our algorithm/model. The datafile might be large and you probably don’t store it locally. You may have written various functions to transform the data, e.g. function a(), b(), c() and d(). Each produces a dataset which is then input to the next function. d() produces the final dataset you want to use.
put a picture here.
Option I
Now maybe your colleague wants to use the output of c() and then do his/her own processing.
You could take an approach similar to cookiecutter and create folders for each output and then write code which checks if the expected data is indeed in the according folder, and if not create it. This implies you might need to go back all the way to the raw data. That is a lot of error-prone boilerplate code which will keep you busy for a while.
Better: use hamilton together with dvc. https://www.tryhamilton.dev/intro
conda install -c sf-hamilton[visualization]
conda install -c conda-forge mamba
Official Doc Site: hamilton.dagworks.io (This is the most comprehensive guide).
The “Hub”: hub.dagworks.io (If you want to see code examples of how people use it for LLMs or Pandas).
Slack Community: Most Hamilton users hang out in their dedicated Slack. Since it’s incubating, the creators usually answer questions there personally.
https://doc.dvc.org/install/linux#install-with-conda
dvc remote add -d local_storage /path/to/your/external/hdd
dvc push
dvc stage add -n process_data \
-d data_pipeline/run_pipeline.py \
-d data_pipeline/transformations.py \
-o data/processed.csv \
python data_pipeline/run_pipeline.py
Option II
Another option is – when using tensorflow – to make the preprocessing a layer of the model.
