Using dbt in a machine learning pipeline @ dbt Meetup
In May of 2019, I gave a talk at the dbt NYC Meetup about how we were using dbt as our ML feature engineering layer at Bowery Farming.
My main points were
- schemas in a dbt project can be described by their purpose and their relationship to up- and down-stream processes
- feature engineering for ML projects is a pain point for many teams, which choices that present tough choices
- 1st party ML platforms: great if you’ve got the team for it
- ML as a service services: SaaS lock-in, platform limitations
- Deploy the notebook: limited re-usability, engineering
- We use a pattern of 3 models in our
predict
schema for batch ML modelsm_[model]_obs
- 1 record per observation with features as columnsm_[model]_preds
- 1 record per production inference (potentially many-to-one to_obs
), differentiated by timestamp and model hashm_[model]
- observations joined to “current” inference for downstream use
Comments