In May of 2019, I gave a talk at the dbt NYC Meetup about how we were using dbt as our ML feature engineering layer at Bowery Farming.

My main points were

  • schemas in a dbt project can be described by their purpose and their relationship to up- and down-stream processes
  • feature engineering for ML projects is a pain point for many teams, which choices that present tough choices
    • 1st party ML platforms: great if you’ve got the team for it
    • ML as a service services: SaaS lock-in, platform limitations
    • Deploy the notebook: limited re-usability, engineering
  • We use a pattern of 3 models in our predict schema for batch ML models
    • m_[model]_obs- 1 record per observation with features as columns
    • m_[model]_preds - 1 record per production inference (potentially many-to-one to _obs), differentiated by timestamp and model hash
    • m_[model] - observations joined to “current” inference for downstream use