An ML project, like any other project requires the project team to get together in the early stages to resolve lots of open questions to better scope the project.
What is the problem being addressed and why is it seen as a problem that needs to be solved?
Very often we find that the problem we are trying to solve may not exist as a problem as soon as you receive context or clarity from another team or senior management. It may also be true that a very simple change in the process or product may remove the need to solve the problem.
What does the potential solution look like?
This is a very good exercise to carry out at the start of a project. Putting together the building blocks you envision and going through a mock simulation of the steps in the process will reveal a lot of open and unanswered questions. You will have a lot more pointed discussions with various teams once you have outlined the potential solution.
Detailed QA plan and model testing plan (mock data for various test cases)
What is the QA checklist for the ML model? What are some use-cases and corresponding model behaviour that is expected?
Metrics to measure performance
ML models undergo rigorous testing during training. But once deployed, what do you need to worry about? There is a large body of work on MLOps, but I strongly suggest you don't read any literature on this until you have identified a few areas that you will monitor once your model is in production.
Maintenance plan (on-call rotation, what are typical failure scenarios and run-books)
Like any critical service that is deployed to production, your ML model is no different in the care that needs to be taken to ensure smooth functioning. Always plan for on-call (including weekends), list out possible failure modes and create run-books that the on-call can follow and rectify the situation and bring up the service.
There is no way to state this enough but documentation is king. Any amount of documentation is not enough but what is most important is to create documentation that helps keep the lights on. At the very least you want to create run-books, API documents, change logs, deployment instructions, training dataset creation steps, ML model retraining procedures and so on.
Feel free to write why you chose ML algorithm 1 over ML algorithm 2, but remember, once your model is in production, the operational docs will be the ones that people will reach for first.
Tickets created and sprints planned for 4 to 6 sprints (2 weeks for each sprint) and tweak/update as sprints progress
Seeing into the future is hard, but steering the future to a plan is relatively easier. Create sprint plans and tickets (assign them to the right folks in the beginning itself) for a few sprints ahead. Give your team a chance to think over and mentally play the sprints in their head so there is very little dissonance once the project kicks off.
Stand-up meeting and project meeting scheduled before the project starts
Setup the stand-up meetings, weekly meetings for the future sprints. Make sure that each meeting invite includes a document that is updated with meeting notes and action items. The on-call for that week runs the sprint meetings and updates the docs with discussion notes and any action items. You should have very little action items outside of the tickets (otherwise create a ticket and groom your backlog).
The discussion shared above was part of many Q&A sessions Harsh Singhal conducted with Data teams at various companies and colleges.