Monday, August 3, 2009

Definition of Done and the Quest for the Potentially Shippable Product

One of the main contributions from Scrum to the software development community is the conscience that has been created about obtaining a tangible deliverable product at the end of each sprint. This tangible product, “potentially shippable product” in Scrum terminology, is what governs all efforts from the team, the Scrum Master and the Product Owner. In theory, the potentially shippable product should be ready for shipping at the end of each sprint, given that if you can find a client willing to buy a product that works with limited functionality.
Theory contradicts reality in many occasions because the team hasn’t implemented enough functionality or this has not been completely tested to be considered as a valuable product that can be delivered to clients. Furthermore, confusion within the team arises when there isn’t a clear and unified Definition of Done (DoD). For instance, for development “done” means that a user story has been implemented but not tested. The DoD needs to be understood by everybody in the team; moreover, the DoD should be extended to different levels to cover all the expectations and requirements from the team and other stakeholders. Not having a clear DoD eventually affects the chances of the team to attain a potentially shippable product; what is even worse, without a clear DoD the team will be working as if trying to hit a moving target during the sprint.
This article will present and analyze different perceptions for the DoD in order to suggest a unified version. A word of caution thought, the DoD is something particular to each project so it needs to be evolved from what we suggest here in order to be applied in particular projects.
Different Types of DoD
Let’s start saying that we have only one team but different people playing different roles in it. As a consequence of their work, engineers tend to have their own perception of how and when something can be considered done, let’s take a look at the following definitions of done:
  • For development some code is considered done when it has been written, compiled, debugged, unit tested, checked-in in a common repository, integrated in a build, verified by a build verification tool and documented using an automatic documentation generator.
  • For automation test cases are considered done when the corresponding automated testing scripts have been coded, unit tested and ran against a build.
  • For quality engineering done means that the user stories received on a specific build have been tested by running testing cycles, encountered bugs have been reported using a bug tracking tool, fixes provided have been validated and the associated trackers have been closed. In consequence, test cycles include both manual and automated test case execution.
  • For info dev done means that all the help files in the product as along with the printed manuals are written in perfect English or any other language.
  • For project managers done means that the sprint has ended and the user stories have been completed (not necessarily accepted by QE) and the schedule hours burned. Moreover, for managers done usually means that there’s something−anything−that can be presented to the stakeholders in a demo.
Now let’s continue analyzing why those definitions, if taken separately, won’t help to build the potentially shippable product:
  • Development DoD is not enough because code was produced but not tested, at least not tested from the users perspective. Crucial tests like functionality, usability, performance, scalability and stress are still pending. Without passing all those tests, at least to a certain level, the product is simply not good enough for being shipped. A big temptation is to consider that in early stages of the product, the code is not required to pass all tests or there’s no need to test it at all. This of course violates the very definitionconception of a potentially shippable product.
  • Automation DoD is not good enough either, because it only covers code that has been written to test code, no bugs discovered, validated, retested or closed. It’s very true that automated test cases save many man hours of manual testing, but it should only be used as a tool that helps Quality Engineers in their work and not as a milestone that if reached qualifies the product for release.
  • Info dev DoD is also falling short because it only considers manuals and help files, not the product itself or the quality it might have. It’s not uncommon to see technically excellent products with a high degree of associated quality but with poor documentation. The opposite scenario is also common and even more problematic.
  • Project Manager’s DoD it’s focused on metrics and high level information. The biggest risk for Project Managers is to look at the wrong metrics and try to put a product out of the door when it’s not complete or stable enough. Avoiding biased perceptions that come from information collected from only one group of engineers in a team is the key for telling if the product is ready or not to go into production.
  • Quality Engineering DoD in my opinion, this definition is the one that contributes the most to the goal of having a potentially shippable product; the reason in simple, QE should be the last check point for telling if the implemented user stories are good enough for being considered as part of a potentially shippable product. One consideration is that the documentation provided by info dev should also be reviewed for technical accuracy. Another is that automated test cases should also be checked for effectiveness and real contribution to bug detection and validation.
It is evident that we need a DoD that can work for the whole team helping it to remain focused in the right direction and serving at the same time as a true enabler for the potentially shippable product concept. But before we go into that, let’s explore the different levels of accomplishment.
Levels of accomplishment
By accomplishment we understand some milestones that the team will be reaching during the project’s execution, there are several levels of accomplishment as we describe below:
  • Task level means that a task, whatever it was, has been completed. Tasks are usually performed by individuals in the team and still need to be cross checked or validated. Tasks are purposely mentioned as the smallest chunk of work and will be that which has been done by somebody in the team. Tasks have an identifiable output in the form of code written, test cases written, environments set, test cases automated, documents prepared, etc.
  • User story level means that all tasks in a user story have been completed, when this occurs we also have fairly good chances to say that the user story has also been accepted. Albeit bugs might appear during testing, once they are fixed the user story can be marked as accepted and will go directly into the potentially shippable product.
  • Sprint level means that the time boxed hours have been burnt; more importantly, the planned user stories should have been accepted. One very common mistake is to consider that a sprint was successful because the team has been able to complete certain amount of work by the sprint end date; this is true to a certain degree. The problem arises when you compare the work that has been done with what can and should be included as part of the potentially shippable product.
  • Release level is achieved when the product finally meets all required criteria −technical, legal, marketing, financial, etc.− to be put out of the door and make it available to clients.
  • Products level means that the team has been able to put several releases out of the door; by releases we don’t refer to hot fixes and service packs. Hot fixes are symptomatic of poor quality as they usually indicate that urgent fixes are required by clients that are reporting serious problems with the product. By the same token, releasing many service packs might be a good radiator for suspecting quality problems in a product, the processes and/or the team.

Unifying the DoD with the level of accomplishment
The ultimate DoD should be such that it really helps the team to envision the true goal behind the sprint: build a potentially shippable product. In that sense the Quality Engineering DoD seems to be the one that gets us closer to that goal for the following reason:
  1. It deals with user stories, quality and completeness.
  2. It also covers automation and info dev DoD; this assertion works under the assumption that QEs are the ones who execute automated testing suites and review written product documentation for technical accuracy.
  3. User stories accepted by QE are probably the best indicators for managers because they reflect the degree of accomplishment on a sprint and are directly correlated to functionally that can be included right into the potentially shippable product.
Consequently, this QE DoD applied to task and user story level will have a positive impact in the sprint accomplishment level, which in turn will favorably impact on the release and product accomplishment levels.

Key Metrics for Measuring Sprint Status

One of the key questions in every project of any engineering field is how to measure progress. Traditionally progress is measured considering how much work has been done in how much time. This simple approach has been working and in fact works very well for intensive production systems like factories were the degree of innovation is constrained for detailed design documents. More to the point, manufacturing tangible and simple things like screws is a repetitive task that can be done always at the same velocity. Consequently, once velocity is known, it will be very easy to extrapolate and predict how long it would take to produce an X amount of screws. Even if we think of building very complex but tangible engineering masterpieces like submarines, we’ll find that the process has been well studied and described to the last details in tons of pages of design documents. Notwithstanding, building commercial software can’t be ruled by the same principles. Key differences are not well defined requirements and changing developing conditions. For years methodologists and gurus have been trying to provide a methodology that reduces complexity to a level that can allow to have predictive planning, results were null and Agile popped up as an alternative that tries not to predict but to adapt to changes and react in kind. However, Scrum practitioners have a new challenge in front of them: how to measure sprint progress. Even if you use Scrum, a project is a project and stakeholders will always ask the typical question: Are we still on schedule? This article will explore some indicators like burn down charts, user stories life cycle, estimated vs. actual hours and team velocity.
Burn down chart
The most basic chart for measuring status is the burn down chart; however it can very imprecise because of the following:
  1. Burned hours that are not related to productive tasks, i.e. hours allocated for Spike activities, reading, researching, attending meetings, helping teammates, etc.
  2. Hours are not the same for development, quality engineering or automation. You can’t expect that producing code is the same as testing it or automate its testing. Further, a burn down chart can look good but you don’t know exactly what sub-team is burning more hours.
  3. Burning hours per se won’t get you close to the completion of user stories; a user story can’t be considered completed just because its planned hours were completed or even exceeded. User story acceptance criteria need to be defined instead.
  4. Burned hours need to be baselined to initial estimation; accuracy could be a symptom of user stories slipping dates.
Some suggestions for improvements are presented below:
  1. Create the culture in your team to log in your tracking tool (Xplanner, VersionOne, an Excel spreadsheet or something else) only the hours that are related to productive work in coding, testing or automation.
  2. Even thought general success or failure is everybody’s responsibility, you need to distinguish the different actors that you have: developers, architects, quality engineers, automation guys, tech writers, etc. Not differentiating them will give you the false impression that everyone can burn hours at the same rate and thus create an illusory burn down chart. Separate burn down charts could be an alternative here.
  3. Burn down hours as a metric need to be cross referenced to user story completion; again, burning hours alone is not a good indicator.
  4. Poor initial estimation, bad reestimations and continuous addition of work are enemies of schedule. It’s almost impossible to have every task in all user stories estimated at the beginning of the sprint, and too much estimation constrains agility. Mitigation for this is to monitor the burn down chart progress against the initial estimation and when a significant gap is perceived, work should be moved back to the release backlog or at least reprioritized within the sprint.
User stories life cycle
There are a couple of good reasons for having a life cycle for sprint user stories:
  1. You need a visual indicator that can quickly show you the progress in a specific user story
  2. You need to have visibility of who (development or quality engineering) currently has having the user story in his/her plate. This itself is a good indicator for the progress of a user story.
During the past years I’ve been using with great success an Xplanner’s status field for indicating the stage of the life cycle that user stories are in, this field can be customized but I’d recommend keeping the default tags like this:
  • Draft means that a user story has been created but no time estimations have been added yet; in other words, no tasks has been created for the user story.
  • Planned means that the user story already has tasks with time estimations. Besides time estimates, a user story is marked as planned when development and quality engineering have both agreed in the acceptance criteria that will be applied to decide if the user story is accepted or not.
  • Implemented means that development has completed the implementation of the user story, all code has been unit tested and has been checked-in in the repository and passed a build verification test. Quality engineers are free to pick the code and start testing it.
  • Verified means that the user story failed to pass all test cases executed by quality engineers, consequently it hasn’t met the acceptance criteria and bugs were reported. At this point, the user story returns to development’s court and will be worked there until all bugs are fixed, once this happens, development will mark the user story as Implemented and the user story will be retested.
  • Accepted means that all test cases passed and no bugs where found or the bugs reported where successfully validated.
Using this user story status field is not complicated, but like many things in Agile, it requires discipline. The benefits for a team that fully adopts and applies this concept are great; in my experience, this indicator complements the burn down chart perfectly.
Estimated hours vs. actual hours
Much has been written about how to make good estimations, tons of books and dissertations have been prepared on the subject, but the truth is that there’s no magic formula that can allow you to have accurate estimations in all situations. The fact of the matter is that maybe you don’t need accurate estimations, at least not for Scrum sprints, some arguments in favor of this rationale are:
  1. The planning stage for a sprint in Scrum has to be by definition very short and very light, this contradicts the goal of having solid estimates that comes from deep analysis.
  2. Scrum is an adaptive business framework; it doesn’t have much to do with predicting things to a great level of detail and precision.
  3. Input for numeric estimation models varies too much from sprint to sprint; sometimes this variation is so uncontrollable that the results are considerably offset in one direction or another.
  4. Estimating tasks for developers is quite different from doing the same for quality engineers, automation guys or tech writers. The very nature of the work of each individual in a team contributes to increase or decrease accuracy. Empirical evidence collected in my projects indicates that quality engineer’s estimations tend to be more accurate than developer’s; the reason for this seems to lie behind the more repetitive tasks that quality engineers usually perform.
Consequently, I wouldn’t recommend that you make your team invest in becoming master estimators, try to use this time to mentor them in how to be reflective and learn from past sprints instead.
Team velocity
Team velocity is an interesting indicator but it has a great disadvantage: it’s based on the number of hours that the team burns and not in the degree of success (accepted user stories) that it achieves in a sprint.
Moreover, team velocity has to be carefully analyzed because it can correspond to several things, not just developers producing code; for instance.
  1. Velocity will tend to increase when quality engineers have enough user stories to test, testing will eventually derive in bugs being reported, bugs will need to be addressed by development slowing down the work in other user stories. Even though velocity will increase, overall productivity will decrease or stay the same at best.
  2. Which team’s velocity are we talking about? Developers’ velocity can’t be compared to that of automation guys when they execute automated test suites. Furthermore, quality engineers doing manual testing is no match for automated testing scripts. If we try to compare velocity among these teams; we’ll see no correlation because of the implicit difference of their work.
  3. Developers writing code apply a different velocity from what they use when they have to do unit testing or bug fixing. Again, velocity can be a misleading indicator because one might tend to think that the team has boosted productivity when, in reality, it has shifted its tasks.
  4. Team velocity shouldn’t be applied without considering individual velocity. Individual velocity can be a very good indicator for detecting blockages or not well defined requirements; in this case individual velocity will tend to cero. Team’s velocity sometimes tends to cover these symptoms because it doesn’t offer enough granularity.