Using dbt in a machine learning pipeline @ dbt Meetup

In May of 2019, I gave a talk at the dbt NYC Meetup about how we were using dbt as our ML feature engineering layer at Bowery Farming. My main points were - schemas in a dbt project can be described by their purpose and their relationship to up- and down-stream processes - feature engineering for ML projects is a pain point for many teams, which choices that present tough choices - 1st party ML platforms: great if you've got the team for it - ML as a service services: SaaS lock-in, platform limitations - Deploy the notebook: limited re-usability, engineering - We use a pattern of 3 models in our `predict` schema for batch ML models - `m_[model]_obs`- 1 record per observation with features as columns - `m_[model]_preds` - 1 record per production inference (potentially many-to-one to `_obs`), differentiated by timestamp and model hash - `m_[model]` - observations joined to "current" inference for downstream use <iframe width="560" height="315" src="https://www.youtube.com/embed/dm0x9bNtO8s?si=x0EF3Rj-lJ_Pp4SF" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> <iframe src="https://docs.google.com/presentation/d/e/2PACX-1vT83_uBkgjoARToUImGhD1cbPZ7C9vPjLXd3LeEEKxU0Q17yQEiqjNCrFROGPmD9tANcZeAbF_eA0xG/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>