In May of 2019, I gave a talk at the dbt NYC Meetup about how we were using dbt as our ML feature engineering layer at Bowery Farming.
My main points were
- schemas in a dbt project can be described by their purpose and their relationship to up- and down-stream processes
- feature engineering for ML projects is a pain point for many teams, which choices that present tough choices
- 1st party ML platforms: great if you've got the team for it
- ML as a service services: SaaS lock-in, platform limitations
- Deploy the notebook: limited re-usability, engineering
- We use a pattern of 3 models in our `predict` schema for batch ML models
- `m_[model]_obs`- 1 record per observation with features as columns
- `m_[model]_preds` - 1 record per production inference (potentially many-to-one to `_obs`), differentiated by timestamp and model hash
- `m_[model]` - observations joined to "current" inference for downstream use
<iframe width="560" height="315" src="https://www.youtube.com/embed/dm0x9bNtO8s?si=x0EF3Rj-lJ_Pp4SF" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vT83_uBkgjoARToUImGhD1cbPZ7C9vPjLXd3LeEEKxU0Q17yQEiqjNCrFROGPmD9tANcZeAbF_eA0xG/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>