Schools of Software Testing: A Debate with Rex Black

August 24th, 2015

Last year, Rex and I had a debate at STPCon on the legitimacy and value of the concept of “schools of software testing.” We recently obtained a recording of the debates and merged it with slides to create a video. You can find additional slides and notes here: Kaner’s STPCon Debate Slides.

Rex Black expressed a few gripes about the concept of “schools” and about the way this concept has been applied in our field.

Schools or Strategies?

Rex’s first point — I think the central point of his presentation — is to acknowledge that there are different approaches to software testing but to argue that we should think of these as differences in strategy rather than divisions into schools. In his view (as I understand it), different people have different preferred strategies for dealing with testing situations. Some people can shift among strategies, choosing the best one for a given situation. As Rex sees it, looking at our field as a collection of schools is divisive and counterproductive.

I think his idea of conflicting (or alternative) strategies is appealing — plausible, but incomplete in a fundamental way.

The problem is that it ignores the social dynamics of the field, which is exactly what we are trying to capture with the idea of “schools.”

People tend to cluster. They find other people whose views or whose personal styles are compatible with theirs. They learn more from people in their cluster, they pay more attention to them, listen more closely to their advice and criticisms. Sometimes people cluster around an intellectually coherent point of view and organize their thinking about their work in terms of that view. At that point, we have the beginnings of a school. It is not just a strategy; it is an approach that is supported by a strong peer group.

This basic kind of clustering is so common that we barely notice it. Sometimes it becomes more pronounced and several of the clusters become more broadly influential.

Fields tend to swing between extremes of high (apparent) cohesiveness to high fragmentation. The evolution takes time.

  • At the high-apparent-cohesiveness extreme, everyone agrees (or pretends to agree) with a dominant view. There is not much controversy. Progress is incremental and not very creative.
  • At the high-fragmentation extreme, people have stopped listening to each other. They squander their creativity on better ways to promote an approach that they see as The One True Way, and to insult or shout down anyone who disagrees with it. There isn’t much progress at this extreme either. People are too busy scoring points about the basics of the field (or the basics of their controversy).

Several authors identify these extremes and describe them as unproductive. Neither extreme promotes an attitude of paying constructive attention to other views and gaining insights from them or taking risks to develop a new approach. (See my slides and notes for references.)

Between the extremes, you have creative tension and a lot of research (or skill development) that tries to get to the factual questions: what works, what happens, what costs, what benefits, what else can be done?

The idea that fields often organize themselves into schools is not controversial. It’s not something special to software testing. You see it in education, business, psychology, physics (etc., etc.)

It’s also common for the members of the dominant school to see themselves as the entire field. They often see other groups that try to differentiate themselves from the main stream as self-promoting spinoffs, as advocates for minor variations from core views that “everyone” shares. One of the reasons that people will intentionally form and announce a school is to create a rallying point. They want a place where likeminded people can share views without being drowned out by the dominating majority, a platform for publishing and refining their set of ideas.

These rallying points are even more important when the field is engaging in political work. When I say “political,” I mean anything involving power and control. For example, standards committees are political. My experience with the IEEE software engineering standards committees is that I can become a member but my views will have no impact on the standard. The way that people who hold minority views gain more impact is to organize, so that many people together intentionally say the same things. This has an impact. For example, agile approaches (I think of contextual thinking and exploratory testing/development as agile approaches) are much more acceptable than they were 20 years ago. That is largely because of advocacy by many people, speaking together. Political work requires political action.

You can hear more about social dynamics in the debate.

Unfortunate Misbehavior

Rex’s central argument is that the characterization of the field in terms of conflicting schools is inaccurate and would be better replaced by a description of alternative strategies. Along with this argument comes a complaint, which I see as the emotional charge behind his argument. He complains about harsh statements from some people who call themselves Context-Driven and call themselves leaders of the Context-Driven School. I think he’s well-justified in feeling that some people are behaving badly and that they have treated him badly.

If you see yourself as a member of the Context-Driven School, let me suggest that as individuals, we get to choose how far we go down the path of divisiveness:

  • We can choose to compare a school of thought to a religion, but we don’t have to say that.
  • We can choose to say that anyone who isn’t a proponent of the school can’t understand what we have to say, but we don’t have to say that.
  • We can choose to say that everyone belongs to a school (even the people who insist they do not), but we don’t have to say that.

Statements like these are not factual and, to the best of my knowledge, they are not rooted in facts. They reflect choices about how people with differing views should interact.

I think some of the people who say things like this would market themselves more honestly (and in my view, tarnish the Context-Driven Testing brand less) if they would identify themselves as the Rapid Software Testing (TM) school. I would disagree with their approach and their tone, but I wouldn’t feel obliged to assert that such views are not context-driven (see for example, my posts Censure People for Disagreeing with Us? and Context-Driven Testing is Not a Religion and Contexts Differ: Recognizing the Difference between Wrong and WRONG. )

More details of my responses to Rex’s complaints are in the debate itself and in the notes I prepared before the debate at Kaner’s STPCon Debate Slides.

— Cem Kaner

Racial Profiling in Ferguson Missouri? A Note on Statistical Interpretation

August 21st, 2014

As I write this, there is serious community tension in Ferguson, Missouri over the shooting and killing of a unarmed black teenager by a white police officer, and over the response to that shooting by the local police force.

The narrative in much of the press is that this is yet another incident that illustrates a serious problem of racism in the United States, especially in our police. For example, many newspapers cite data from the Missouri 2013 Vehicle Stops Report. The overall report covers data in every county in Missouri and presents years of historical comparisons. The report specifically for Ferguson is here.

Here are some examples of press coverage that seem representative of what I’ve seen in many papers online:

“Last year, 86 percent of the cars stopped by Ferguson police officers were being driven by African-Americans, according to the state’s annual racial profiling report. Once pulled over in Ferguson, African-American drivers were twice as likely to be searched, according to the report.”

http://www.mcclatchydc.com/2014/08/19/237001_feds-could-go-several-ways-in.html?rh=1#storylink=cpy

“Last year, for the 11th time in the 14 years that data has been collected, the disparity index that measures potential racial profiling by law enforcement in the state got worse. Black Missourians were 66 percent more likely in 2013 to be stopped by police, and blacks and Hispanics were both more likely to be searched, even though the likelihood of finding contraband was higher among whites.”

http://www.stltoday.com/news/opinion/columns/the-platform/editorial-michael-brown-and-disparity-of-due-process/article_40bb2d0e-8619-534a-b629-093ebc79f0a6.html

Of course, there are some counter-examples. Some news reports (what I’ve seen on Fox, for example) seem to ignore these data completely and instead appear to me to present the events in terms of violent bad black people who deserve whatever violent treatment the police provide for them. There is nothing useful to learn about data evaluation from these reports, so I will ignore them for the rest of this note.

I should state my bias: My personal (nonexpert) impression is that the shooting was unjustified and that the St. Louis County police response has been inappropriate. I have no insight into the motivation of anyone involved.

However, if you look at the actual numbers from Ferguson, it is not clear to me that the conclusions of racial profiling, conclusions like the ones quoted above, that have appeared in every news source that I respect, are justified by the data.

The focus of this blog is on the teaching of software engineering topics, primarily software testing and measurement (and thus too, statistical analysis).

The data from Ferguson provide an interesting example for caution in the interpretation of such data.

First, some of the numbers that are consistent with the summaries. According to the Attorney General’s report

  • Ferguson’s population (age 16 and over) is 15,865, of whom 63% are black.
  • 4632 of 5384 vehicle stops (86%) were of blacks, a much higher percentage than the 63% of the population
  • 562 of the 611 searches (92%) were of blacks
  • 483 of the 521 arrests (93%) were of blacks
  • 12.13% of the blacks who were stopped were searched, compared to only 6.85% of the whites
  • 21.71% of the blacks who were searched had contraband (drugs, weapons, stolen property) compared to 34.04% of the whites.

These data appear to suggest two conclusions:

  1. Blacks are being stopped, searched and arrested at a higher rate than their representation in the population
  2. Many more searches of blacks than whites are unproductive, suggesting the police would find more contraband if they searched fewer blacks and more whites.

If Ferguson’s police are continuing to search blacks at a much higher frequency than whites, even though searched whites have contraband at a higher frequency than searched blacks, this appears to suggest a pattern that is racist and counterproductive (less protective of public safety).

That conclusion, I think, is the conclusion the newspapers are inviting us to draw.

Let’s look at some more data.

  • 66% (369/562) of the searches and 76% (369/483) of the arrests of blacks involved an outstanding warrant
  • 30% (14/47) of the searches and 39% (14/36) of the arrests of whites involved an outstanding warrant

Searches and arrests involving warrants don’t involve much exercising of judgment on the part of the officer who is searching or arresting someone.

  • A warrant is an order from a court to arrest someone. The officer is supposed to stop and arrest a person if there is a warrant out for them.
  • When a police officer arrests someone, they must search the person. Among the many important reasons for this rule is the safety of the officer: arresting someone and then not checking them carefully for weapons would be extremely unwise.

In a community of only 15,865 people, it would not be surprising for the local police to be aware of most of the people who have warrants outstanding against them or for these police to recognize those people on the street.

Because the police are supposed to arrest people who have warrants against them and supposed to search people they arrest, I don’t think we should count these numbers of stops, searches and arrests against the police.

If you look only at the stops that didn’t involve outstanding warrants,

  • 34% of the times that police searched a black person, and 24% of the times the police arrested a black person, the search did not involve an outstanding warrant.

In contrast

  • 70% of the times that the police searched a white person, and 61% of the time they arrested a white person, the search did not involve an outstanding warrant.

The conclusions that these numbers suggest to me are that:

  • the Ferguson police appear to have been making discretionary stops (stops in which they were exercising their own judgment, rather than executing a court order) of white people at almost twice the rate as for black people
  • the higher contraband-find rate for whites than blacks might be because a higher proportion of whites were searched on the basis of police suspicion of contraband, compared to a higher proportion of blacks being searched as part of an arrest that involved a warrant (past bad behavior, not currently suspicious behavior). Considered this way, the disparity (higher rate of contraband-finds for whites versus blacks) seems unsurprising and not at all suggestive of bad police work.

I don’t know what truth underlies these numbers. I think that, for me to interpret them with any confidence, I would have to do other studies, such as riding along with Ferguson police and learning how they decide who to stop and what post-stop behaviors trigger further investigation (such as searches or checks for outstanding warrants).

What does seem clear to me is that the first conclusion (racially-motivated differences in the police officers’ decisions to search people) is not supported by these data. That motivation might be present but—despite first appearances—these data do not seem to be evidence of it.

I wrote this note because it suggests two important lessons for students of statistics and research design:

  1. In many cases (as here), data may show statistically significant (large, probably consistent) differences. However, interpretation of those differences is almost always open to further investigation.
    • The numbers don’t tell you what they mean. Even the most statistically significant trends must be interpreted by people.
  2. In many cases (as here), the data support alternative interpretations.
    • Whenever possible, you should look at your data in many ways, to see if they tell you the same story. If they don’t, you need to investigate further, and maybe fix your model.
    • A few numbers in isolation tell you very little, often much less than you would initially imagine.
    • If you design your research (or your management) so that you will see only a few numbers at the end, you are designing tunnel vision into your work. You are creating your own context for bad interpretations and bad decisions.

BBST Domain Testing Pilot: A few seats still available (June 22-July 19)

May 11th, 2014

We’re putting the finishing touches on BBST Domain Testing before the online pilot starts June 22. We have a handful of seats still available. (Sorry, the class is now full and we have had to close registration CK — 5/23/2014)

Apply to participate here

If you are selected, we will require a $50 non-refundable deposit to defray the cost of hosting the class. You will also need The Domain Testing Workbook (either print or electronic format available at contextdrivenpress.com or Amazon) to use in the class.

You can read Chris Kenst’s report on his experience in January’s face-to-face pilot at http://www.testingcircus.com/?wpdmact=process&did=Mi5ob3RsaW5r

As with all the BBST courses, this course comes with lots of homework. You should expect to work 10-15 hours per week on the course.

If you have questions, please contact us at info@bbst.info.

 

On the Quality of Qualitative Measures

April 28th, 2014

On the Quality of Qualitative Measures

Cem Kaner, J.D., Ph.D. & Rebecca L. Fiedler, M.B.A., Ph.D.

This is an informal first draft of an article that will summarize some of the common guidance on the quality of qualitative measures.

  • The immediate application of this article is to Kaner’s courses on software metrics and software requirements analysis. Students would be well-advised to read this summary of my lectures carefully (yes, this stuff is probably on the exam).
  • The broader application is to the increasingly large group of software development practitioners who are considering using qualitative measures as a replacement for many of the traditional software metrics. For example, we see a lot of attention to qualitative methods in the agenda of the 2014 Conference of the Association for Software Testing. We won’t be able to make it to CAST this year, but perhaps these notes will provide some additional considerations for their discussions.

On Measurement

Managers have common and legitimate informational needs that skilled measurement can help with. They need information in order to (for example…)

  • Compare staff
  • Compare project teams
  • Calculate actual costs
  • Compare costs across projects or teams
  • Estimate future costs
  • Assess and compare quality across projects and teams
  • Compare processes
  • Identify patterns across projects and trends over time

Executives need these types of information, whether we know how to provide them or not.

Unfortunately, there are strong reasons to be concerned about the use of traditional metrics to answer questions like these. These are human performance measures. As such, they must be used with care or they will cause dysfunction (Austin, 1996). That has been a serious real-life problem (e.g. Hoffman 2000). The empirical basis supporting several of them has been substantially exaggerated (Bossavit, 2014). Many of the managers who use them know so little about mathematics that they don’t understand what their measurements mean and their primary uses are to placate management or intimidate staff. Many of the consultants who give talks and courses advocating metrics also seem to know little about mathematics or about measurement theory. They seem unable to distinguish strong questions from weak ones, unable to discuss the underlying validity of the measures they advocate, and so they seem reliant on appeals to authority, on the intimidating quality of published equations, and on the dismissal of the critic as a nitpicker or an apologist for undisciplined practices.

In sum, there are problems with the application of traditional metrics in our field. It is no surprise that people are looking for alternatives.

In a history of psychological research methods, Kurt Danziger (1994) discusses the distorting impact of quantification on psychological measurement. (See especially his Chapter 9, From quantification to methodolatry.) Researchers designed experiments that looked more narrowly at human behavior, ignoring (designing out of the research) those aspects of behavior or experience that they could not readily quantify and interpret in terms of statistical models.

“All quantitative data is based upon qualitative judgments.”
(Trochim, 2006 at http://www.socialresearchmethods.net/kb/datatype.php)

Qualitative methods might sometimes provide a richer description of a project or product that is less misleading, easier to understand, and more effective as a source of insight. However, there are problems with the application of qualitative approaches.

  • Qualitative reports are, at their core, subjective.
  • They are subject to bias at every level (how the data are gathered or selected, stored, analyzed, interpreted and reported). This is a challenge for every qualitative researcher, but it is especially significant in the hands of an untrained researcher.
  • They are based on selected data.
  • They aren’t very helpful for making comparisons or for providing quantitative estimates (like, how much will this cost?).
  • They are every bit as open to abuse as quantitative methods.
  • And it costs a lot of effort to do qualitative measurement well.

We are fans of measurement (qualitative or quantitative) when it is done well and we are unenthusiastic about measurement (qualitative or quantitative) when it is done badly or sold overenthusiastically to people who aren’t likely to understand what they’re buying.

Because this paper won’t propagandize qualitative measurement as the unquestioned embodiment of sweetness and light, some readers might misunderstand where we are coming from. So here is a little about our background.

  • As an undergraduate, Kaner studied mainly mathematics and philosophy. He also took two semesters of coursework with Kurt Danziger. We only recently read Danziger (1994) and realized how profoundly Danziger has influenced Kaner’s professional development and perspective. As a doctoral student in experimental psychology, Kaner did some essentially-qualitative research (Kaner et al., 1978) but most of his work was intensely statistical, applying measurement theory to human perception and performance (e.g. Kaner, 1983). He applied qualitative methods to client problems as a consultant in the 1990’s. His main stream of published critiques of traditional quantitative approaches started in 1999 (Kaner, 1999a, 1999b). He wrote explicitly about the field’s need to use qualitative measures in 2002. He started giving talks titled “Software Testing as a Social Science” in 2004, explicitly identifying most software engineering measures as human performance measures subject to the same types of challenges as we see in applied measurement in psychology and in organizational management.
  • Fiedler’s (2006, 2007) dissertation used Cultural-Historical Activity Theory (CHAT)–a framework for analyzing and organizing qualitative investigations–to examine portfolio management software in universities. Kaner & Fiedler started applying CHAT to scenario test design in 2007. We presented a qualitative-methods tutorial and a long paper with detailed pointers to the literature at CAST in 2009 and at STPCon in Spring 2013. We continue to use and teach these ideas and have been working for years on a book relating qualitative methods to the design of scenario tests.

We aren’t new to qualitative methods. This is not a shiny new fad for us. We are enthusiastic about increasing the visibility and use of these methods but we are keenly aware of the risk of over-promoting a new concept to the mainstream in ways that dilute the hard parts until all that remains are buzzwords and rituals. (For us, the analogies are Total Quality Management, Six Sigma, and Agile Development.)

Perhaps some notes on what makes qualitative measures “good” (and what doesn’t) might help slow that tide.

No, This is Not Qualitative

Maybe you have heard a recommendation to make project status reporting more qualitative. To do this, you create a dashboard with labels and faces. The labels identify an area or issue of concern, such as how buggy the software is. And instead of numbers, use colored faces because this is more meaningful. A red frowny-face says, There is trouble here. A yellow neutral-face says, Things seem OK, nothing particularly good or bad to report now. And a green smiley-face says, Things go well. You could add more differentiation by having a brighter red with a crying-face or a screaming-or-cursing-face and by having a brighter green with a happy-laughing face.

See, there are no numbers on this dashboard, so it is not quantitative, right?

Wrong.

The faces are ordered from bad to good. You can easily assign numerals to these, 1 for red-screaming-face through 5 for green-laughing-face, you can talk about the “average” (median) score across all the categories of information, you can even draw graphs of the change of confidence (or whatever you map to happyfacedness) from week to week across the project.

This might not be very good quantitative measurement but as qualitative measurement it is even worse. It uses numbers (symbols that are equivalent to 1 to 5) to show status without emphasizing the rich detail that should be available to explain and interpret the situation.

When you watch a consultant present this as qualitative reporting, send him away. Tell him not to come back until he actually learns something about qualitative measures.

OK, So What is Qualitative?

A qualitative description of a product or process is a detail-rich, multidimensional story (or collection of stories) about it. (Creswell, 2012; Denzin & Lincoln 2011; Patton, 2001).

For example, if you are describing the value of a product, you might present examples of cases in which it has been valuable to someone. The example wouldn’t simply say, “She found it valuable.” The example would include a description of what made it valuable, perhaps how the person used it, what she replaced with it, what made this one better than the last one, and what she actually accomplished with it. Other examples might cover different uses. Some examples might be of cases in which the product was not useful, with details about that. Taken together, the examples create an overall impression of a pattern – not just the bias of the data collector spinning the tale he or she wants to tell. For example, the pattern might be that most people who try to do THIS with the product are successful and happy with it, but most people who try to do THAT with it are not, and many people who try to use this tool after experience with this other one are likely to be confused in the following ways …

When you describe qualitatively, you are describing your perceptions, your conclusions, and your analysis. You back it up with examples that you choose, quotes that you choose, and data that you choose. Your work should be meticulously and systematically even-handed.  This work is very time-consuming.

Quantitative work is typically easier, less ambiguous, requires less-detailed knowledge of the product or project as a whole, and is therefore faster.

If you think your qualitative measurement methods are easier, faster and cheaper than the quantitative alternatives, you are probably not doing the qualitative work very well.

Quality of Qualitative

In quantitative measurement, questions about the value of a measure boil down to questions of validity and reliability.

A measurement is valid to the extent that it provides a trustworthy description of the attribute being measured. (Shadish, Cook & Campbell, 2001)

A measurement is reliable to the extent that repeating the same operations (measuring the same thing in the same ways) yields the same (or similar) results.

In qualitative work, the closest concept corresponding to validity is credibility (Guba & Lincoln, 1989). The essential question about the credibility of a report of yours is, Why should someone else trust your work? Here are examples of some of the types of considerations associated with credibility.

Examples of Credibility-Related Considerations

The first and most obvious consideration is whether you have the background (knowledge and skill) to be able to collect, interpret and explain this type of data.

Beyond that, several issues come up frequently in published discussions of credibility. (Our presentation is based primarily on Agostinho, 2005; Creswell, 2012; Erlandson et al., 1993; Finlay, 2006; and Guba & Lincoln, 1989.)

  • Did you collect the data in a reasonable way?
    • How much detail?: Students of ours work with qualitative document analysis tools, such as ATLAS.ti, Dedoose, and NVivo. These tools let you store large collections of documents (such as articles, slides, and interview transcripts), pictures, web pages, and videos (https://en.wikipedia.org/wiki/Computer_Assisted_Qualitative_Data_Analysis_Software). We are now teaching scenario testers to use the same types of tools. If you haven’t worked with one of these, imagine a concept-mapping tool that allows you to save all the relevant documents as sub-documents in the same document as the map and allows you to show the relationships among them not just with a two-dimensional concept map but with a multidimensional network, a set of linkages from any place in any document to any place in any other document.

    As you see relevant information in a source item, you can code it. Coding means applying meaningful tags to the item, so that you can see later what you were thinking now. For example, you might code parts of several documents as illustrating high or low productivity on a project. You can also add comments to these examples, explaining for later review what you think is noteworthy about them. You might also add a separate memo that describes your ideas about what factors are involved in productivity on this project, and another memo that discusses a different issue, such as notes on individual differences in productivity that seem to be confounding your evaluation of tool-caused differences. Later, you can review the materials by looking at all the notes you’ve made on productivity—all the annotated sources and all your comments.

    You have to find (or create) the source materials. For example, you might include all the specification-related documents associated with a product, all the test documentation, user manuals from each of your competitors, all of the bug reports on your product and whatever customer reports you can capture for other products, interviews with current users, including interviews with extremely satisfied users, users who abandoned the product and users who still work with the product but hate it. Toss in status reports, comments in the source code repository, emails, marketing blurbs, and screen shots. All these types of things are source materials for a qualitative project.

    You have to read and code the material. Often, you read and code with limited or unsophisticated understanding at first. Your analytical process (and your ongoing experience with the product) gives you more insight, which causes you to reread and recode material. The researcher typically works through this type of material in several passes, revising the coding structure and adding new types of comments (Patton, 2001). New information and insights can cause you to revise your analysis and change your conclusions.

    The final report gives a detailed summary of the results of this analysis.

    • Prolonged engagement: Did you spend enough time at the site of inquiry to learn the culture, to “overcome the effects of misinformation, distortion, or presented ‘fronts’, to establish rapport and build the trust necessary to overcome constructions, and to facilitate immersing oneself in and understanding the context’s culture”?
    • Persistent observation: Did you observe enough to focus on the key elements and to add depth? The distinction between prolonged engagement and persistent observation is the difference between having enough time to make the observations and using that time well.
    • Triangulation and convergence. Triangulation leads to credibility by using different or multiple sources of data (time, space, person), methods (observations, interviews, videotapes, photographs, documents), investigators (single or multiple), or theory (single versus multiple perspectives of analysis).” (Erlandson et al. 1993, p. 137-138). “The degree of convergence attained through triangulation suggests a standard for evaluating naturalistic studies. In other words, the greater the convergence attained through the triangulation of multiple data sources, methods, investigators, or theories, the greater the confidence in the observed findings. The convergence attained in this manner, however, never results in data reduction but in an expansion of meaning through overlapping, compatible constructions emanating from different vantage points.” (Erlandson et al. 1993, p. 139).
  • Are you summarizing the data fairly?
  • How are you managing your biases (people are often not conscious of the effects of their biases) as you select and organize your observations?
  • Are you prone to wishful thinking or to trying to please (or displease) people in power?
    • Peer debriefing: Did you discussion your ideas with one or more disinterested peers who gave constructively critical feedback, questioned your ideas, methods, motivation, and conclusions?
    • Disconfirming case analysis: Did you look for counter-examples? Did you revise your working hypotheses in light of experiences that were inconsistent with them?
    • Progressive subjectivity: As you observed situations or created and looked for data to assess models, how much did you pay attention to your own expectations? How much did you consider the expectations and observations of others. An observer who affords too much privilege to his or her own ideas is not paying attention.
    • Member checks: If you observed / measured / evaluated others, how much did you involve them in the process? How much influence did they have over the structures you would use to interpret the data (what you saw or heard or read) that you got from them? Do they believe you accurately and honestly represented their views and their experiences? Did you ask?

Transferability

The concerns that underlie transferability are much the same as for external validity (or generalization validity) of traditional metrics:

  • If you or someone else did a comparable study in a different setting, how likely is it that you would make the same observations (see similar situations, tradeoffs, examples, etc.)?
  • How well would your conclusions apply in a different setting?

When evaluating a research report, thorough description is often a key element. The reader doesn’t know what will happen when someone tries a similar study in the future, so they (and you) probably cannot authoritatively predict generalizability (whether people in other settings will see the same things). However, if you describe what you saw well enough, in enough detail and with enough attention to the context, then when someone does perform a potentially-comparable study somewhere else, they will probably be able to recognize whether they are seeing things that are similar to what you were seeing.

Over time, a sense of how general something is can build as multiple similar observations are recorded in different settings

Dependability

The concerns that underlie dependability are similar to those for internal validity of traditional metrics. The core question is whether your work is methodologically sound.

Qualitative work is more exploratory than quantitative (at least, more exploratory than quantitative work is traditionally described). You change what you do as you learn more or as you develop new questions. Therefore consistency of methodology is not an ultimate criterion in qualitative work, as it is for some quantitative work.

However, a reviewer can still ask how well (methodologically) you do your work. For example:

    • Do you have the necessary skills and are you applying them?
    • If you lack skills, are you getting help?
    • Do you keep track of what you’re doing and make your methodological changes deliberately and thoughtfully? Do you use a rigorous and systematic approach?

Many of the same ideas that we mentioned under credibility apply here too, such as prolonged engagement, persistent observation, effort to triangulate, disconfirming case analysis, peer debriefing, member checks and progressive subjectivity. These all describe how you do your work.

  • As issues of credibility, we are asking whether you and your work are worth paying attention to? Your attention to methodology and fairness reflect on your character and trustworthiness.
  • As issues of methodology, we are asking more about your skill than about your heart.

Confirmability

Confirmability is as close to reliability as qualitative methods get, but the qualitative approach does not rest as firmly on reliability. The quantitative measurement model is mechanistic. It assumes that under reasonably similar conditions, the same acts will yield the same results. Qualitative researchers are more willing to accept the idea that, given what they know (and don’t know) about the dynamics of what they are studying, under seemingly-similar circumstances, the same things might not happen next time.

We assess reliability by taking repeated measurements (do similar things and see what happens). We might assess confirmability as the ability to be confirmed rather than whether the observations were actually confirmed. From that perspective, if someone else works through your data:

  • Would they see the same things as you?
  • Would they generally agree that things you see as representative are representative and things that you see as idiosyncratic are idiosyncratic?
  • Would they be able to follow your analysis, find your records, understand your ways of classifying things and agree that you applied what you said you applied?
  • Does your report give your reader enough raw data for them to get a feeling for the confirmability of your work?

In Sum

Qualitative measurements tell a story (or a bunch of stories). The skilled qualitative researcher relies on transparency in methods and data to tell persuasive stories. Telling stories that can stand up to scrutiny over time takes enormous work. This work can have great value, but to do it, you have to find time, gain skill, and master some enabling technology. Shortchanging any of these areas can put your credibility at risk as decision-makers rely on your stories to make important decisions.

References

S. Agostinho, (2005, March). “Naturalistic inquiry in e-learning research“, International Journal of Qualitative Methods 4 (1).

R. D. Austin (1996). Measuring and Managing Performance in Organizations. Dorset House.

L. Bossavit (2014). The Leprechauns of Software Engineering: How folklore turns into fact and what to do about it. Leanpub.

J. Creswell (2012, 3rd ed.). Qualitative Inquiry and Research Design: Choosing Among Five Approaches. Sage Publications.

K. Danziger (1994). Constructing the Subject: Historical Origins of Psychological Research. Cambridge University Press.

N.K. Denzin & Y.S. Lincoln (2011, 4th ed.) The SAGE Handbook of Qualitative Research. Sage Publications.

D.A. Erlandson, E.L. Harris,B.L. Skipper & S.D. Allen, S. D. (1993). Doing Naturalistic Inquiry: A Guide to Methods. Sage Publications.

R. L. Fiedler (2006). “In transition”: An activity theoretical analysis examining electronic portfolio tools’ mediation of the preservice teacher’s authoring experience. Unpublished Ph.D. dissertation, University of Central Florida (Publication No. AAT 3212505).

R.L. Fiedler (2007). “Portfolio authorship as a networked activity“. Paper presented at the Society for Information Technology and Teacher Education.

R. L. Fiedler & C. Kaner (2009). “Putting the context in context-driven testing (an application of Cultural Historical Activity Theory)“. Conference of the Association for Software Testing. Colorado Springs, CO.

L. Finlay (2006). “‘Rigour’, ‘Ethical Integrity” or ‘Artistry”? Reflexively reviewing criteria for evaluating qualitative research.” British Journal of Occupational Therapy. 69 (7), 319-326.

E.G. Guba & Y.S. Lincoln (1989). Fourth Generation Evaluation. Sage Publications.

D. Hoffman (2000). “The darker side of metrics,” presented at Pacific Northwest Software Quality Conference, Portland, OR.

C. Kaner (1983). Auditory and visual synchronization performance over long and short intervals, Doctoral Dissertation: McMaster University.

C. Kaner (1999a). “Don’t use bug counts to measure testers.” Software Testing & Quality Engineering, May/June, 1999, p. 80.

C. Kaner (1999b). “Yes, but what are we measuring?” (Invited address) Pacific Northwest Software Quality Conference, Portland, OR.

C. Kaner (2002). “Measuring the effectiveness of software testers.”15th International Software Quality Conference (Quality Week), San Francisco, CA.

C. Kaner (2004). “Software testing as a social science.” IFIP Working Group 10.4 meeting on Software Dependability, Siena, Italy.

C. Kaner & R. L. Fiedler (2013). “Qualitative Methods for Test Design“. Software Test Professionals Conference (STPCon), San Diego, CA.

C. Kaner, B. Osborne, H. Anchel, M. Hammer & A.H. Black (1978). “How do fornix-fimbria lesions affect one-way active avoidance behavior?86th Annual Convention of the American Psychological Association, Toronto, Canada.

M.Q. Patton (2001, 3rd ed.). Qualitative Research & Evaluation Methods. Sage Publications.

W.R. Shadish, T.D. Cook & D.T. Campbell (2001, 2nd ed.). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Cengage.

W.M.K. Trochim (2006). Research Methods Knowledge Base.

Wikipedia (2014). https://en.wikipedia.org/wiki/Computer_Assisted_Qualitative_Data_Analysis_Software

 

Why propose an advanced certification in software testing?

March 26th, 2014

A couple of weeks ago, I posted A proposal for an advanced certification in software testing. There were plenty of comments, on the blog, on Twitter, and in private email to me.

I think the best way to respond to these is with a series of posts, each one focused on a different issue. This first one goes to the fundamental question, Why should we create such a thing?

I used to see certifications as irrelevant (and misleading)

For a long time, when people asked me whether they should get certified in software testing, I said no. I would say that, in my opinion, there is no value in the current certifications.

I know more good testers who are not certified than good ones who are certified. I feel as though I’ve met a whole lot of clueless fools who carry testing certifications.

Many of the exam-review courses teach to the exam and present an oversimplified and outdated view of the field. I think that, from a what-will-you-learn perspective, taking them is a waste of time and money.

It used to seem obvious to me that certification must be irrelevant to a tester’s career.

The market proved me wrong

Unfortunately, my predictions that the community would see the ISTQB/ASQ/QAI-type credential as irrelevant were proved wrong.

The fact that hundreds of thousands of people in our field have decided to get certified demonstrates, in and of itself, that the credential is widely perceived as relevant.

I think that a willingness to discover and publish that you were mistaken is one of the critical traits of a scientist. The history of science is the story of of a never-ending stream of ideas that were well supported at the time—but were proved wrong. They were replaced with better ideas that were more useful and better-supported—and proved wrong too.

It seems to me that I can’t be a great tester (or an adequate scientist) if I am reluctant to subject my beliefs and ideas to the same level of criticism that I apply to the work of others.

In retrospect, I realize that I misread the evolution of certification in 1990 through 2010.

  • The testing community had a growing core of people who had decided to do this work as a career. They believed they were committed to doing good work and that they were good at what they did.
  • The demand for testing services was exploding, with floods of new people who had little background, varying levels of commitment and increasingly inflated salary expectations.
  • Many of the people who saw themselves as professionals were getting tired of being characterized as unskilled, clueless bureaucrats by so many other people in the development community.
  • Many of the people involved in recruiting testers or setting their pay scales don’t know enough about testing to tell the good ones from incompetents who can spin persuasive resumes and interviews.
  • In this environment, even if you are a test manager with really good hiring instincts, you still have the challenge of justifying the salaries you want to pay to people who don’t understand your staff.

Certification was sold as a formal credential, something that demonstrates (at a minimum) that you are committed enough to the field to go through the hassle of getting certified. And as proof that you are at least familiar with the basics of the field and that you are good enough at precision reading to be able to pass a formal exam.

If there is no stronger credential in the field, it is easy to see this as better than nothing.

I think that some of the get-certified sales pitches goes far beyond than this, saying or implying that certification demonstrates that a person has genuine professional competence. I think that goes far beyond what any of these certifications could possibly attest to, but I think that’s the impression that is sometimes encouraged.

We can argue about the motivation and about the marketing. We can speculate endlessly about why someone would spend good money on exam-prep courses so they could get one or more of these certifications.

I think it is more useful to ask whether we can give them better value for their time and money.

One approach: Open Certification

My interest in creating a better alternative to the current certifications is not new. Back in 2006, Mike Kelly and I started hosting workshops to plan an “Open Certification”. The idea was to create a huge, open pool of multiple-choice questions and to examine candidates via a random stratified sample of questions from the pool. However, there were some insurmountable problems:

  • We were determined to not be tied to one proprietary body of knowledge. But consider this example: Suppose we are willing to accept six different widely-used definitions of “test case.” Which one is the right one for this exam? And what if the student encounters (and answers on the basis of) Definition 7? How do we say that one is wrong?
    • The obvious way to deal with this is to write the question to say “Famous Person 1’s definition of test case is …” but what do people have to do to prepare for such an exam? Do they have to memorize 6 different definitions and the names of the people we tie those definitions to? Almost no one could pass such an exam. An even if you could pass it, all the memorizing you would have to do in order to pass it would be an abuse of your time.
  • We were determined, back then, to do something extremely cheap or free. But the development and maintenance costs for the software and questions were going to be very high. Even if we could get volunteer labor to create the first drafts of the exam (and exam site), we would need to do a lot of sustaining engineering. People were going to have to be paid.
  • The exam would be free but with this complex a series of questions, how long would it be before training companies started selling exam prep courses? The cost of the exams is not the big cost factor in the other certifications. It is the cost of the training. Were we kidding ourselves about making a difference here?
  • Finally, there was the most difficult problem. Even if the exam was successful, it would still be a bunch of multiple-choice questions. Our approach to certification wouldn’t be offering any better evidence of deep knowledge or skill than the others.

I forget his exact words, but Mike laid out an important criterion early in the project. If we couldn’t be confident of developing something clearly better than the alternative we were replacing, we shouldn’t bother doing it. As we proceeded, it became clearer and clearer that we were creating something that might be cheaper, but that probably wasn’t better.

Eventually, we pulled the plug on Open Certification.

But that was not abandonment of the idea of a better certification. It was a recognition that we didn’t have a better idea, yet.

In parallel with the Open Certification project, I was transforming BBST from a purely academic course to a very student-challenging industrial course.

One of the really valuable outcomes of the Open Certification meetings was a “standard” for drafting challenging multiple-choice test questions. I applied this to the BBST courses, creating a suite of quiz questions that BBST’s graduates have come to know and love.

But we didn’t stop with multiple-choice. We used multiple-choice as a tutorial tool, not as the core examiner. BBST demanded a much higher level of knowledge and skill than I knew how to get from multiple-choice exams. I concluded that something along these lines was a better way to go.

Another alternative

Rather than trying to replace the ASQ/ISTQB/QAI approach,  I think we can build on it.

  • Let people get one of those credentials. Or let them get some other credential that is challenging but that approaches the field in less simplistic terms. Treat their credential-from-training as a baseline.
  • From here, let the tester present a portfolio of evidence that s/he can do more than just pass an exam or two—that s/he can actually do competent work in the field.

The person who can demonstrate both, mastery of basic training and a competent portfolio gets an advanced certification.

I think this gives us two important advances:

  • It breaks out of the ideological stranglehold that a few vendors have had on credentialing in our field.
  • It presents a richer view of the capabilities and contributions of the person who carries the credential.

This isn’t perfect, but it’s better. I think that has some value.

 

A proposal for an advanced certification in software testing

March 3rd, 2014

This is a draft of a proposal to create a more advanced, more credible credential (certification) in software testing.

The core idea is a certification based on a multidimensional collection of evidence of education, experience, skill and good character.

  • I think it is important to develop a credential that is useful and informative.
    • I think we damage the reputation of the field if we create a certification that requires only a shallow knowledge of software testing.
    • I think we damage the value of the certification if we exaggerate how much knowledge or skill is required to obtain it.
  • I think it is important to find a way to tolerate different approaches to software testing, and different approaches to training software testers. This proposal is not based on any one favored “body of knowledge” and it is not tied to any one ideology or group of vendors.

The idea presented here is imperfect—as are the other certifications in our field. It can be gamed—as can the others. Someone who is intent on gaining a credential via cheating and fraud can probably get away with it for a while—but the others have security risks too. This certification does not assure that the certified person is competent—neither do the others. The certification does not subject the certified person to formal professional accountability for their work—neither do the others—and even though certificate holders say that they will follow a code of ethics, we have no mechanism for assuring that they do or punishing them if they don’t—and neither do the others.

With all these we-don’t-do-thises and we-don’t-promise-thats, you might think I’m kidding about this being a real proposal. I’m not.

Even if we agree that this proposed certification lacks the kinds of powers that could be bestowed by law or magic, I think it can provide useful information and that it can create incentives that favor higher ethics in job-seeking and, eventually, professional practice. It is not perfect, but I think it is far better than what we have now.

The Proposal

This credential is based on a collection of several different types of evidence that, taken together, indicate that the certificate holder has the knowledge and skill needed to competently perform the usual services provided by a software tester.

Here are the types of evidence. As you read this, imagine that the Certification Body hosts a website that will permanently post a publicly-viewable dossier (a collection of files) for every person certified by that body. The dossier would include everything submitted by an applicant for certification, plus some additional material. Here’s what we’d find in the file.

Authorization by the Applicant

As part of the application, the applicant for Certification would grant the Certification Board permission to publish all of the following materials. The applicant would also sign a legal waiver that would shield the Board from all types of legal action by the applicant / Certified Tester arising out of publication of the materials described below. The waiver will also authorize the Board to exercise its judgment in assessing the application and will shield the Board from legal action by the applicant if the Board decides, in its unfettered discretion, to reject the applicant’s application or to later cancel the applicant’s Certification.

Education (Academic)

The Certified Tester should have at least a minimum level of formal education. The baseline that I imagine is a bachelor’s-level degree in a field relevant to software testing.

  • Some fields, such as software engineering, are obviously relevant to software testing. But what about others like accounting, mathematics, philosophy, physics, psychology, or technical writing? We would resolve this by requiring the applicant for certification to explain in writing how and why her or his education has proved to be relevant to her or his experiences as a tester and why it should be seen as relevant education for someone in the field.
  • The requirement for formal education should be waived if the applicant requests waiver and justifies the request on the basis of a sufficient mix of practical education and professional achievement.

Education (Practical)

The Certified Tester should have successfully completed a significant amount of practical training in software testing. Most of this training would typically be course-based, typically commercial training. Some academic courses in software testing would also qualify. A non-negotiable requirement is successful completion of at least some courses that are considered advanced. “Successful” completion means that the student completed an exam or capstone project that a student who doesn’t know the material would not pass.

  • There is an obvious accreditation issue here. Someone has to decide which courses are suitable and which are advanced.
  • I think that many different types of courses and different topics might be suitable as part of the practical training. For example, suppose we required 100 classroom-hours of training (1 training day = 6 classroom hours). Perhaps 60 of those hours could be in related fields (programming, software metrics, software-related law, project accounting, etc.) but a core would have to be explicitly focused on testing.
  • I think the advanced course hours (24 classroom hours?) would have to be explicitly advanced software testing courses.
  • There is no requirement that these courses come from any particular vendor or that they follow any particular software testing or software development ideology.

Examination

The Certified Tester should have successfully completed a proctored, advanced, examination in software testing.

  • This requirement anticipates competing exams offered by several different groups that endorse different approaches to software testing. Our field does not have agreement on one approach or even one vocabulary. The appearance of agreement that shows up in industry “standards” is illusory. As a matter of practice (I think, often good practice), the standards are routinely ignored by practitioners. Examinations that adopt or endorse these standards should be welcome but not mandatory.

Which exams are suitable and which are advanced?

There is an obvious accreditation issue here. Someone has to decide which exams are suitable and which are advanced.

I am inclined to tentatively define an advanced exam as one that requires as minimum prerequisites (a) successful completion of a specified prior exam and (b) additional education and experience. For example, ISTQB Foundations would not qualify but an ISTQB Advanced or Expert exam might. Similarly, BBST:Foundations would not qualify but BBST:Bug Advocacy might and BBST:Domain Testing definitely should.

An exam might be separate from a course or it might be a final exam in a sufficiently advanced course.

For an exam to be used by a Certified Tester, the organization that offers and grades the exam must provide the Certification Board with a copy of a sample exam. The organization must attest under penalty of perjury that they believe the sample is fairly representative of the scope and difficulty of the actual current exam. This sample will appear on the Certification Board’s website, and be accessible as a link from the Certified Tester’s dossier. (Thus, the dossier doesn’t show the Certified Tester’s actual exam but it does show an exam that is comparable to the actual one.)

What about the reliability and the validity of the exams?

Let me illustrate the problem with two contrasting examples:

  • I think it is fair to characterize ISTQB as an organization that is striving to create highly reliable exams. To achieve this, they are driven toward questions that have unambiguously correct answers. Even in sample essay questions I have seen for the Expert exam, the questions and the expected answers are well-grounded in a published, relatively short, body of knowledge. I think this is a reasonable and respectable approach to assessment and I think that exams written this way should be considered acceptable for this certification.
  • The BBST assessment philosophy emphasizes several other principles over reliability. We expect answers to be clearly written, tightly focused on the question that was asked, with a strong logical argument in favor of whatever position the examinee takes in her or his answer, that demonstrates relevant knowledge of the field. We expect a diversity of points of view. I think it gives the examiner greater insight into the creativity and depth of knowledge of the examinee. I think this is also a reasonable and respectable approach to assessment that we should also consider acceptable for this certification.

There is a tradeoff between these approaches. Approaches like ISTQB’s are focused on the reliability of the exam, especially on between-grader reliability. This is an important goal. The BBST exams are not focused on this. For certification purposes, we would expect to improve BBST reliability by using paired grading (two examiners) but this is imperfect. I would not expect the same level of reliability in BBST exams that ISTQB achieves. However, in my view of the assessment of cognitively complex skills, I believe the BBST approach achieves greater validity. Complicating the issue, there are problems in the measurement of both, reliability and validity, of education-related exams.

The difference here is not just a difference of examination style. I believe it reflects a difference in ideology.

Somehow, the Certification Board will have to find a way to accredit some exams as “sufficiently serious” tests of knowledge even though one is obviously more reliable than the other, one is obviously more tightly based on a published body of knowledge than the other, etc.

Somehow, the Certification Board will have to find a way to refuse to accredit some exams even though they have the superficial form of an exam. In general, I suspect that the Certification Board will cast a relatively broad net and that if groups like ASQ and QAI offer advanced exams, those exams will probably qualify. Similarly, I suspect that a final exam in a graduate-level university course that is an “advanced” software testing course (prerequisite being successful completion of an earlier graduate-level course in testing) would qualify.

Professional Achievement

Professional achievements include publications, honors (such as awards), and other things that indicate that the candidate did something at a professional level.

An applicant for certification does not have to include any professional achievements. However, if the applicant provides them, they will become part of the applicant’s dossier and will be publicly visible.

Some decisions will lie in the discretion of the Certification Board. For example, the Certification Board:

  • might or might not accept an applicant’s academic background as sufficiently relevant (or as sufficiently complete)
  • might or might not accept an applicant’s training-experience portfolio as sufficient or as containing enough courses that are sufficiently related to software testing

In such cases, the Certification Board will consider the applicant’s professional achievements as additional evidence of the applicant’s knowledge of the field.

References

The applicant will provide at least three letters of endorsement from other people who have stature in the field. These letters will be public, part of the Certified Tester’s dossier. An endorsement is a statement from a person that, in that person’s opinion, the applicant has the knowledge, skills and character needed to competently provide the services of a professional software tester. The letter should provide additional details that establish that the endorser knows the knowledge, skill and character of the applicant well enough to credibly make an endorsement.

  • A person of stature is someone who is experienced in the field and respected. For example, the person might be (this is not a complete list)
    • personally known to the Certification Board
    • a Certified Tester
    • a Senior Member or Distinguished Member or Fellow of ACM, ASQ, or IEEE
  • If one of the endorsers withdraws his or her endorsement, that withdrawal will be published in the Certified Tester’s dossier along with the original endorsement (now marked “withdrawn”) and the Certified Tester will be required to get a new endorser.
  • If one of the apparent endorsers contacts the Certification Board and asserts that s/he did not write an endorsement for an applicant and that s/he does not endorse the applicant, and if the apparent endorser provides credible proof of identify, that letter will be published in the Certified Tester’s dossier along with the original letter (now marked “disputed”).

Professional Experience

The applicant will provide a detailed description of his or her professional history that includes at least N years of relevant experience.

  • The applicant must attest that this description is true and not materially incomplete. It will be published as part of the dossier. Potential future employers will be able to check the claims made here against the claims made in the applicant’s application for work with them.
  • The descriptions of relevant positions will include descriptions of the applicant’s role(s) and responsibilities, including typical tasks s/he performed in that position
  • The applicant’s years of relevant experience and years of formal education will interact: Someone with more formal education that is relevant to the field will be able to become certified with less relevant experience (but never less than K years of experience).

Continuing Education

The candidate must engage in professional activities, including ongoing study, to keep the certification.

Code of Ethics

The candidates must agree to abide by a specific Code of Ethics, such as the ACM code. We should foresee this as a prelude to creating an enforcement structure in which a Certified Tester might be censured or certification might be publicly canceled for unethical conduct.

Administrative Issues

Somehow, we have to form a Certification Board. The Board will have to charge a fee for application because the website, the accrediting activities, evaluation of applications, marketing of the certification, etc., will cost money.

Benefits

This collection of material does not guarantee competence, but it does present a multidimensional view of the capability of an experienced person in the field. It speaks to a level of education and professional involvement and to the credibility of self-assertions made when someone applies for a job, submits a paper for publication, etc. I think that the public association of the endorser with the people s/he endorses will encourage most possible endorsers to think carefully about who they want to be permanently publicly identified with. I think the existence of the dossier will discourage exaggeration and fraud by the Certified Tester.

It is not perfect, but I think it will be useful, and better than what I think we have now.

This is not a certification of a baseline of competence in the way that certifications (licenses) work in fields like law, engineering, plumbing, and cosmetology. Those are regulated professions in which the certified person is subject to penalties and civil litigation for conduct that falls below baseline. Software engineering (including software testing) is not a regulated profession, there is no such cause of action in the courts as “software engineering malpractice,” and there are no established penalties for incompetence. There is broad disagreement in the field about whether such regulations should exist (for example, the Association for Computing Machinery strongly opposes the licensing of software engineers while the IEEE seems inclined to support it) and the creation of this certification does not address the desirability of such regulation.

The Current Goal: A Constructive Discussion

This article is a call for discussion. It is not yet a call for action, though I expect we’ll get there soon.

This article follows up an article I wrote last May about credentialing systems. I identified several types of credentials in use in our field and suggested four criteria for a better credential:

  • reasonably attainable (people could affort to get the credential, and reasonably smart people who worked hard could earn it),
  • credible (intellectually and professionally supported by senior people in the field who have earned good reputations),
  • scalable (it is feasible to build an infrastructure to provide the relevant training and assessment to many people), and
  • commercially viable (sufficient income to support instructors, maintainers of the courseware and associated documentation, assessors (such as graders of the students and evaluators of the courses), some level of marketing (because a credential that no one knows about isn’t worth much), and in the case of this group, money left over for profit. Note that many dimensions of “commercial viability” come into play even if there is absolutely no profit motive—-the effort has to support itself, somehow).

I think the proposal in this article sketches a system that would meet those criteria.

A more detailed draft of this proposal was reviewed at the 2014 Workshop on Teaching Software Testing. We did not debate alternative proposals or attempt to reach consensus. The ideas in this paper are not the product of WTST. Nor are they the responsibility of any participant at WTST. However, I am here acknowledging the feedback I got at that meeting and thanking the participants: Scott Allman, Janaka Balasooriya, Rex Black, Jennifer Brock, Reetika Datta, Casey Doran, Rebecca L. Fiedler, Scott Fuller, Keith Gallagher, Dan Gold, Douglas Hoffman, Nawwar Kabbani, Chris Kenst, Michael Larsen, Jacek Okrojek, Carol Oliver, Rob Sabourin, Mike Sowers, and Andy Tinkham. Payson Hall has also questioned the reasoning and offered useful suggestions.

To this point, we have been discussing whether these ideas are worthwhile in principle. That’s important and that discussion should continue.

We have not yet begun to tackle the governance and implementation issues raised by this proposal. It is probably time to start thinking about that.

  • I’m positively impressed by (what I know of) the governance model of ISTQB and wonder whether we should follow that model.
  • I would expect to be an active supporter/contributor to the governance of this project (for example an active member of the governing Board). However—just as I helped found AST but steadfastly refused to run for President of AST—I believe we can find a better choice than me for chief executive of the project.

Comments?

New Book: Foundations of Software Testing–A BBST Workbook

February 14th, 2014

New Book: Foundations of Software Testing—A BBST Workbook

Rebecca Fiedler and I just published our first book together, Foundations of Software Testing—A BBST Workbook.

Becky and I started working on the instructional design for the online version of the BBST (Black Box Software Testing) course in 2004. Since then,Foundations has gone through three major revisions. Bug Advocacy and Test Design have gone through two.

Our Workbooks mark our first major step toward the next generation of BBST™.

We are creating the new versions of BBST through Kaner, Fiedler & Associates, a training company that we formed to provide an income stream for ongoing evolution of these courses. BBST is a registered trademark of Kaner, Fiedler & Associates.

What’s in the Book

The Workbook includes slides, lecture transcripts, orientation activities and feedback, application activities, exam advice, and author reflections. Here are some some details:

All the course slides

Foundations has 304 slides. Some of these are out of date. We provide notes on these in the Author’s Reflections.

A transcript of the six lectures.

The transcripts are almost word-for-word the same as the spoken lecture. They actually reproduce the script that I wrote for the lecture. In a few cases, my scripts are a little longer than what actually made it past the video edits. We lay the transcript and the slides out together, side-by-side. In an 8.5×11 printed book, this is a great format for taking notes. Unfortunately, it doesn’t translate to Kindle well, so there is no Kindle edition of the book.

Four Orientation Activities

Orientation activities introduce students to a key challenge considered in a lecture. The student puzzles through the activity for 30 to 90 minutes, typically before watching the lecture, then sees how the lecture approaches this type of problem. The typical Foundations course has two to four of these.

The workbook presents the instructions for four activities, along with detailed feedback on them, based on past performance of students in the online and university courses.

I revised, rewrote or added (new) all of these activities for this Workbook. Because, in my opinion, the most important learning in BBST comes from what the students actually do in the class, the new Orientation and Application activities create a substantial revision to the course.

In my university courses, I practice continuous quality improvement, revising all of them every term in response to (a) my sense (and to what ever relevant data I have collected) about strengths and weaknesses that showed up in previous of the course or (b) ideas that have demonstrated their value in other courses and can be imported into this one. Most of the updates are grounded in a long series of revisions that I used and evaluated in my university-course version of BBST.

Two Application Activities

An application activity applies ideas or techniques presented in a lecture or developed over several lectures. The typical application activity calls for two to six hours of work, per student. The typical Foundations course has one to two of these.

One of these is revised from the public BBST, the other completely rewritten.

Advice on answering our essay-style exam questions

The advice runs 11 pages. I also provide a practice question and detailed feedback on the structure of the answer.

I think the advice is good for anyone taking the course, but it is particularly focused on university students who are preparing for an exam that will yield graded results (A, B, Pass-with-distinction, etc.). The commercial versions of BBST are typically pass-fail, so some of the fine details in this advice are beyond the needs of those readers. If you are a university student, I recommend this as a tighter and more polished presentation than the exam-preparation essay included in the public course.

Author reflections

My reflections present my sense of the strengths and weaknesses of the current course, the ways we are addressing those with the new activities, and some of the changes we see coming in the next generation of videos.

Because Foundations is written to introduce students to the fundamental challenges in software testing, some of my reflections add commentary on widely-debated issues in the field. Some of these might become focus points for the usual crowd to practice their Sturm und Drang on Twitter.

Who the Book is For

We want to support three groups with the BBST Workbooks:

  • Self-studiers. Many people watch the course videos on their own. The course videos for the current version of Foundations are available for free, along with the slides, the course readings, and the public-course versions of four activities and the study guide (list of essay questions for the exam). The Workbook updates the activities, and provides detailed feedback for all of the orientation activities, and provides several design notes on the orientation and application activities. If you are studying BBST on your own or with a few friends, we believe this provides much better support than the videos alone.
  • In-house trainers. If you are planning to teach BBST to staff at your company, the Workbook is an inexpensive textbook to support the students. The feedback for the activities provides a detailed survey of common issues and ideas raised in each activity. If your trainees submit their work to you for review, you might want to supplement these notes with comments that are specific to each student’s work. The comments in the workbook should cover most of the comments that you would otherwise repeat from student to student. The instructors’ reflections will, we hope, give you ideas about how to tailor the application activities (or replace them) to make them suitable for your company.
  • Students in instructor-led courses. The BBST Foundations in Software Testing Workbook is an affordably-priced (retail price $19.99) supplement to any instructor-led course. Students will appreciate the convenience of print versions of the course slides and lectures for ongoing reference. Instructors will appreciate the level of feedback provided to students in the workbook.

Buy the Book
Foundations of Software Testing: A BBST Workbook is available from:

Evolution of the BBST Courses

February 13th, 2014

With our first teaching of the new BBST:Domain Testing course (based on The Domain Testing Workbook) and our revision of BBST:Foundations with the Foundations of Software Testing workbook, Rebecca Fiedler and I have started to introduce the next generation of BBST. Recently, we’ve been getting requests for papers or interviews on where BBST came from and where it’s going.

  • This note is a short summary of the history of BBST. You can find many more details in the articles I’ve been posting to this blog over the last decade, as I tried to think through the course design’s strengths, weaknesses and potential.
  • My next post, and an upcoming article in Testing Circus, look at the road ahead.

What is BBST™ ?

BBST is a series of courses on Black Box Software Testing. The overall goal of the series is to improve the state of the practice in software testing by helping testers develop useful testing skills and deeper insights into the challenges of the field.

(Note: BBST is a registered trademark of Kaner, Fiedler & Associates.)

Today’s BBST Courses

Today, most people familiar with the BBST courses think of a four-week, fully online course. Rebecca Fiedler and I started working on the instructional design for the BBST online courses back in 2004, with funding from the National Science Foundation. The courses have gotten excellent reviews. We’ve taken Foundations through three major revisions. Bug Advocacy and Test Design have had two. We’re working on our next major update now. You’ll read more about that in my next post.

The typical instructor-led course is organized around six lectures (about six hours of talk, divided into one-hour parts), with a collection of activities. To successfully complete a typical instructor-led course, a student spends about 12-15 hours per week for 4 weeks (48-60 hours total). Most of the course time is spent on the activities:

  • Orientation activities introduce students to a key challenge considered in a lecture. The student puzzles through the activity for 30 to 90 minutes, typically before watching the lecture, then sees how the lecture approaches this type of problem. The typical Foundations course has two to four of these.
  • Application activities call for two to six hours of work. It applies ideas or techniques presented in a lecture or developed over several lectures. The typical Foundations course has one to two of these.
  • Multiple-choice quizzes help students identify gaps in their knowledge or understanding of key concepts in the course. These questions are tough because they are designed to be instructional or diagnostic (to teach you something, to deepen your knowledge of something, or to help you recognize that you don’t understand something) rather than to fairly grade you.
  • Various other discussions that help the students get to know each other better, chew on the course’s multiple-choice quiz questions, or consider other topics of current interest.
  • An essay-style final exam.

In the instructor-led course, students get feedback on the quality of their work from each other and, to a lesser or greater degree (depends on who’s teaching the course), they get feedback from the instructors. Students in our commercial courses (which we offer through Kaner, Fiedler & Associates) get a lot of feedback. Students in courses taught by unpaid volunteer instructors are more likely to get most of their feedback from the other students.

So, that’s today (in the online course).

However, BBST has actually been around for 20 years.

Background on the BBST Course Design

I started teaching BBST in 1994, with Hung Quoc Nguyen, for the American Society for Quality in Silicon Valley. This was the commercial version of the course (taught to people working as testers). Development of the course was significantly influenced by:

  • Detailed peer reviews of the live class and of the circulating slide decks. The reviews included detailed critiques from colleagues when I made significant course updates (I offered free beta-review classes to test the updates).
  • Co-teaching the material with colleagues. We would learn together by cross-teaching material, often challenging points in each other’s slides or lecture in front of the students. For example, I taught with James Bach, Elisabeth Hendrickson, Doug Hoffman and (for the metrics material) Pat Bond, a professor at Florida Tech.
  • Rational Software, which contracted with me to create a customized version of BBST to support testing under the Rational Unified Process. They criticized the course in detail over several pilot teachings, and allowed me to apply what I learned back to the original course.

In 1999, I decided that if I wanted to learn how to significantly improve the instructional value of the course, I was going to have to see how teachers help students learn complex topics and skills in university. My sense was, and is, that good university instruction goes much deeper and demands more from the students than most commercial training.

Florida Tech hired me in 2000 to teach software engineering and encouraged me to evolve BBST down two parallel tracks:

  • a core university course that would be challenging for our graduate students and a good resume-builder when our students looked for jobs
  • a stronger commercial course that demanded more from the students.

We correctly expected that the two tracks would continually inform each other. Getting feedback from practitioners would help us keep the academic stuff real and useful. Trying out instructional ideas in the classroom would give us ideas for redesigning the learning experience of commercial students.

By 2003, I realized that most of my students were doing most of their learning outside the classroom. They claimed to like my lectures, but they were learning from assignments and discussions that happened out of the classroom. In 2004, I decided to try taping the lectures. The students could watch these at home, while we did activities in the classroom that had previously been done out of class. This went well, and in 2005, I created a full set of course videos.

I used the 2005 videos in my own classes. I put a Creative Commons license on the videos and posted them, along with other supporting materials, on my lab’s website. Rebecca Fiedler and I also started giving talks to educators about our results, such as these two papers (Association for Educational & Communications Technology conference and the Sloan Conference on Asynchronous Learning Networks in 2005).

These days, what we were doing has a name (“flipping“) and the Open Courseware concept is old news. Back then, it was still hard to find examples of other people doing this. Even though many other people were experimenting with the same ideas, not many people were yet publishing and so we had to puzzle through the instructional ideas by reading way too much stuff and thinking way too hard about way too many conflicting opinions and results. We summarized our own design ideas in the 2005 presentations (cited above). A good sample of the literature we were reading appeared in our applications for funding to the National Science Foundation, such as the one that was funded (2007), which gave us the money to pay graduate students to help with evaluation and redesign of the course (yielding the current public version).

For readers interested in the “science” that informed our course design, I’m including an excerpt from the 2007 Grant Proposal at the end of this article.

The Collaboration with AST

While we were sketching the first BBST videos, we were also working to form AST (the Association for Software Testing). AST incorporated in 2004. Perhaps a year later, Rebecca and I decided that the academic version of online BBST could probably be adapted for working testers. The AST activists at that time were among my closest professional friends, so it was natural to bring this idea to them.

We began informally, of course. We started by posting a set of videos on a website, but people kept asking for instructor support—for a “real” class. By this point (late 2006), the Florida Tech course was maturing and I was confident (in retrospect, laughably overconfident) that I could translate what was working in a mixed online-plus-face-to-face university class to a fully online course for practitioners located all over the world. The result worked so badly that everyone dropped out (even the instructors).

We learned a lot from the details of the failure, looked more carefully at how other university instructors had redesigned traditional academic courses to make them effective for remote students who had full-time jobs and who probably hadn’t sat in an academic classroom for several years (so their academic skills were rusty). After a bunch more pilot-testing, I offered the first BBST:Foundations as a one-month class (essentially the modern structure) in October, 2007.

We offered BBST:Foundations through AST, adding BBST:Bug Advocacy in 2008, redoing BBST:Foundations (new slides, videos, etc.) in 2010, and adding BBST:Test Design in 2012.

AST was our learning lab for commercial courseware. Florida Tech’s testing courses, and my graduate research assistants at Florida Tech, were my learning lab for the academic courseware. I would try new ideas at Florida Tech and bring the ones that seemed promising into the AST courses as minor or major updates. All the while, I was publishing the courseware at my lab’s website, testingeducation.org, and encouraging other people to use the material in their courses.

We trained and supervised a crew of volunteer instructors for AST’s BBST, but other people were teaching the course (or parts of it) too. This included professors, commercial trainers, managers teaching their own staff how to test, etc. Becky created an instructor’s course (how to teach BBST), which we offered as an instructor-led course through AST but which we also offered as a free learning experience on the web (study it yourself at your own pace). In 2012, we published a 357-page Instructor’s Manual for BBST. We published the book as a Technical Report (a publication method available to university professors) so that we could supply it to the public for free.

Underlying much of the AST collaboration was a hope that we could create an open courseware community that would function like some of the successful open software communities.

  • In the open source software world, many of the volunteers who maintain and enhance open source software are able to charge people for support services. That is, the software (courseware) is free but if you want support, you have to pay for it. The support money creates an income stream that makes it possible for skilled people to spend time improving the software.
  • We hoped that we could create a similar type of structure for open source courseware (the BBST courses). You can see the thinking, for example, in a 2008 paper that Rebecca and I wrote with Scott Barber, Building a free courseware community around an online software testing curriculum.

It turns out that this is a very complex idea. It is probably too complex for a small professional society that handles most of its affairs pretty informally.

For now, Rebecca and I have formed Kaner, Fiedler & Associates to sustain BBST instead. That is, KFA sells commercial BBST training and the income stream makes it possible for us to make new-and-improved versions of BBST.

AST might also create its own project to maintain and enhance BBST. If so, we’ll probably see the evolution of contrasting designs for the next generations of the courses. We think we’d learn a lot from that dynamic and we hope that it happens.

An Excerpt from our 2007 Grant Proposal

This is from our application for NSF Award CCLI-0717613, Adaptation and Implementation of an Activity-Based Online or Hybrid Course in Software Testing. (When we acknowledge support from NSF, we are required to remind you that National Science Foundation does not endorse any opinions, findings, conclusions or recommendations that arose out of NSF-funded research.) The full application is available online but it is very concisely written, structured according to very specific NSF guidelines, and packed with points that address NSF-specific concerns. Here is the most relevant section of that 56-page document here, in terms of explaining our approach and literature review for the course’s instructional design.

3. Our Current Course (Black Box Software Testing—BBST)

We adopted the new teaching method in Spring 2005 after pilot work in 2004. Our new approach spends precious student contact hours on active learning experiences (more projects, seminars and labs) that involve real-world problems, communication skills, critical thinking, and instructor scaffolding [129, 136] without losing the instructional benefits of polished lectures. Central to a problem-based learning environment is that students focus on “becoming a practitioner, not simply learning about practice” [122, p. 3]

Anderson et al.’s [11] update to Bloom’s taxonomy [20] is two-dimensional, knowledge and cognitive processing.

  • On the Knowledge dimension, the levels are Factual Knowledge (such as the definition of a software testing technique), Conceptual Knowledge (such as the theoretical model that predicts that a given test technique is useful for finding certain kinds of bugs), Procedural Knowledge (how to apply the technique), and Metacognitive Knowledge (example: the tester decides to study new techniques on realizing that the ones s/he currently knows don’t apply well to the current situation.)
  • On the Cognitive Process dimension, the levels are Remembering (such as remembering the name of a software test technique that is described to you), Understanding (such as being able to describe a technique and compare it with another one), Applying (actually doing the technique), Analyzing (from a description of a case in which a test technique was used to find a bug, being able to strip away the irrelevant facts and describe what technique was used and how), Evaluating (such as determining whether a technique was applied well, and defending the answer), and Creating (such as designing a new type of test.).

For most of the material in these classes, we want students to be able to explain it (conceptual knowledge, remembering, understanding), apply it (procedural knowledge, application), explain why their application is a good illustration of how this technique or method should be applied (understanding, application, evaluation), and explain why they would use this technique instead of some other (analysis).

3.1 We organize classes around learning units that typically include:

  • Video lecture and lecture slides. Students watch lectures before coming to class. Lectures can convey the lecturer’s enthusiasm, which improves student satisfaction [158] and provide memorable examples to help students learn complex concepts, tasks, or cultural norms [47, 51, 115]. They are less effective for teaching behavioral skills, promoting higher-level thinking, or changing attitudes or values [19]. In terms of Bloom’s taxonomy [11, 20], lectures would be most appropriate for conveying factual and conceptual knowledge at the remembering and understanding levels. Our students need to learn the material at these levels, but as part of the process of learning how to analyze situations and problems, apply techniques, and evaluate their own work and the work of their peers. Stored lectures are common in distance learning programs [138]. Some students prefer live lectures [45, 121] but on average, students learn as well from video as live lecture [19, 139]. Students can replay videos [53] which can help students whose first language is not English. Web-based lecture segments supplement some computer science courses [34, 44]. Studio-taped, rehearsed lectures with synchronously presented slides (like ours) have been done before [29]. Many instructors tape live lectures, but Day and Foley [30-34] report their students prefer studio-produced lectures over recorded live lectures. We prefer studio-produced lectures because they have no unscripted interruptions and we can edit them to remove errors and digressions.
  • Application to a product under test. Each student joins an open source software project (such as Open Office or Firefox) and files work with the project (such as bug reports in the project’s bug database) that they can show and discuss during employment interviews. This helps make concepts “real” to students by situating them in the development of well-regarded products [118]. It facilitates transfer of knowledge and skills to the workplace, because students are doing the same tasks and facing the same problems they would face with commercial software [25]. As long as the assignments are not too far beyond the skill and knowledge level of the learner, authentic assignments yield positive effects on retention, motivation, and transfer [48, 52, 119, 153].
  • Classroom activities. We teach in a lab with one computer per student. Students work in groups. Activities are open book, open web. The teacher moves from group to group asking questions, giving feedback, or offering supplementary readings that relate to the direction taken by an individual group. Classroom activities vary. Students might apply ideas, practice skills, try out a test tool, explore ideas from lecture, or debate a question from the study guide. Students may present results to the class in the last 15 minutes of the 75-minute class. They often hand in work for (sympathetic) grading: we use activity grades to get attention [141] and give feedback, not for high-stakes assessment. We want students laughing together about their mistakes in activities, not mourning their grades [134].
  • Examples. These supplementary readings or videos illustrate application of a test technique to a shipping product. Worked examples can be powerful teaching tools [25], especially when motivated by real-life situations. They are fundamental for some learning styles [43]. Exemplars play an important role in the development and recollection of simple and complex concepts [23, 126, 146]. The lasting popularity of problem books, such as the Schaum’s Outline series and more complex texts like Sveshnikov [148] attests to the value of example-driven learning, at least for some learners. However, examples are not enough to carry a course. In our initial work under NSF Award EIA-0113539 ITR/SY+PE: Improving the Education of Software Testers, we expected to be able to bring testing students to mastery of some techniques through practice with a broad set of examples. Padmanabhan [113, 132] applied this to domain testing in her Master’s thesis project at Florida Tech, providing students with 18 classroom hours of instruction, including lecture, outlines of ways to solve problems, many practice exercises and exams. Students learned exactly what they were taught. They could solve new problems similar to those solved in class. However, in their final exam, we included a slightly more complicated problem that required them to apply their knowledge in a way that had been described in lecture but not specifically practiced. The students did the same things well, in almost exactly the same ways. However, they all failed to notice problems that should have been obvious to them but that only required a small stretch from their previous drills. This result was a primary motivator for us to redesign the testing course from a lecture course heavy with stories, examples and practice to more heavily emphasize more complex activities.
  • Assigned readings.
  • Assignments, which may come with grading rubrics. These are more complex tasks than in-class activities. Students typically work together over a two-week period.
  • Study guide questions. At the start of the course, we give students a list of 100 questions. All midterm and final exam questions come from this pool. We discuss use and grading of these questions in [60] and make that paper available to students. We encourage group study, especially comparison of competing drafts of answers. We even host study sessions in a café off campus (buying cappuccinos for whoever shows up). We encourage students to work through relevant questions in the guide at each new section of the class. These help self-regulated learners monitor their progress and understanding—and seek additional help as needed. They can focus their studying and appraise the depth and quality of their answers before they write a high-stakes exam. Our experience of our students is consistent with Taraban, Rynearson, & Kerr’s [149]—many students seem not to be very effective readers or studiers, nor very strategic in the way they spend their study time—as a result, they don’t do as well on exams as we believe they could. Our approach gives students time to prepare thoughtful, well-organized, peer-reviewed answers. In turn, this allows us to require thoughtful, well-organized answers on time-limited exams. This maps directly to one of our objectives (tightly focused technical writing). We can also give students complex questions that require time to carefully read and analyze, but that don’t discriminate against students whose first language is not English because these students have the questions well in advance and can seek guidance on the meaning of a question.

Excerpt from the Proposal’s references:

11. Anderson, L.W., Krathwohl, D.R., Airasian, P.W., Cruikshank, K.A., Mayer, R.A., Pintrich, P.R., Raths, J. and Wittrock, M.C. A Taxonomy for Learning, Teaching &  Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, 2001.

19. Bligh, D.A. What’s the Use of Lectures? Jossey-Bass, San Francisco, 2000.

20. Bloom, B.S. (ed.), Taxonomy of Educational Objectives: Book 1 Cognitive Domain. Longman, New York, 1956

23. Brooks, L.R. Non-analytic concept formation and memory for instances. in Rosch, E. and Lloyd, B.B. eds. Cognition and categorization, Erlbaum, Hillsdale, NJ, 1978, 169-211.

25. Clark, R.C. and Mayer, R.E. e-Learning and the Science of Instruction. Jossey-Bass/Pfeiffer, San Francisco, CA, 2003.

29. Dannenberg, R.P. Just-In-Time Lectures, undated.

30. Day, J.A. and Foley, J. Enhancing the classroom learning experience with Web lectures: A quasi-experiment GVU Technical Report GVU-05-30, 2005.

31. Day, J.A. and Foley, J., Evaluating Web Lectures: A Case Study from HCI. in CHI ’06 (Extended Abstracts on Human Factors in Computing Systems), (Quebec, Canada, 2006), ACM Press, 195-200. Retrieved January 4, 2007, from http://www3.cc.gatech.edu/grads/d/Jason.Day/documents/er703-day.pdf

32. Day, J.A., Foley, J., Groeneweg, R. and Van Der Mast, C., Enhancing the classroom learning experience with Web lectures. inInternational Conference of Computers in Education, (Singapore, 2005), 638-641. Retrieved January 4, 2007, fromhttp://www3.cc.gatech.edu/grads/d/Jason.Day/documents/ICCE2005_Day_Short.pdf

33. Day, J.A., Foley, J., Groeneweg, R. and Van Der Mast, C. Enhancing the classroom learning experience with Web lecturesGVU Technical Report GVU-04-18, 2004.

34. Day, J.A. and Foley, J.D. Evaluating a web lecture intervention in a human-computer interaction course. IEEE Transactions on Education49 (4). 420-431. Retrieved December 31, 2006.

43. Felder, R.M. and Silverman, L.K. Learning and teaching styles in engineering education. Engineering Education78 (7). 674-681.

44. Fintan, C., Lecturelets: web based Java enabled lectures. in Proceedings of the 5th annual SIGCSE/SIGCUE ITiCSE Conference on Innovation and technology in computer science education, ( Helsinki, Finland, 2000), 5-8.

45. Firstman, A. A comparison of traditional and television lectures as a means of instruction in biology at a community college., ERIC, 1983.

47. Forsyth, D., R. The Professor’s Guide to Teaching: Psychological Principles and Practices. American Psychological Association, Washington, D.C., 2003.

48. Gagne, E.D., Yekovich, C.W. and Yekovich, F.R. The Cognitive Psychology of School Learning. HarperCollins, New York, 1994.

51. Hamer, L. A folkloristic approach to understanding teachers as storytellers. International Journal of Qualitative Studies in Education12 (4). 363-380, from http://ejournals.ebsco.com/direct.asp?ArticleID=NLAW20N8B16TQKHDEECM.

52. Haskell, R.E. Transfer of learning: Cognition, instruction, and reasoning. Academic Press, San Diego, 2001.

53. He, L., Gupta, A., White, S.A. and Grudin, J. Corporate Deployment of On-demand Video: Usage, Benefits, and Lessons, Microsoft Research, Redmond, WA, 1998, 12.

60. Kaner, C., Assessment in the software testing course. in Workshop on the Teaching of Software Testing (WTST), (Melbourne, FL, 2003), from http://kaner.com/pdfs/AssessmentTestingCourse.pdf

113. Kaner, C. and Padmanabhan, S., Practice and transfer of learning in the teaching of software testing. in Conference on Software Engineering Education & Training, (Dublin, 2007).

115. Kaufman, J.C. and Bristol, A.S. When Allport met Freud: Using anecdotes in the teaching of Psychology. Teaching of Psychology28 (1). 44-46.

118. Lave, J. and Wenger, E. Situated Learning: Legitimate Peripheral Participation. Cambridge University Press, Cambridge, England, 1991.

119. Lesh, R.A. and Lamon, S.J. (eds.). Assessment of authentic performance in school mathematics. AAAS Press, Washington, DC, 1992.

121. Maki, W.S. and Maki, R.H. Multimedia comprehension skill predicts differential outcomes of web-based and lecture courses.Journal of Experimental Psychology: Applied8 (2). 85-98.

122. MaKinster, J.G., Barab, S.A. and Keating, T.M. Design and implementation of an on-line professional development community: A project-based learning approach in a graduate seminar Electronic Journal of Science Education, 2001.

126. Medin, D. and Schaffer, M.M. Context theory of classification learning. Psychological Review85 (207-238)

129. National Panel Report. Greater Expectations: A New Vision for Learning as a Nation Goes to College, Association of American Colleges and Universities, Washington, D.C., 2002.

132. Padmanabhan, S. Domain Testing: Divide & Conquer Department of Computer Sciences, Florida Institute of Technology, Melbourne, FL, 2004.

134. Paris, S.G. Why learner-centered assessment is better than high-stakes testing. in Lambert, N.M. and McCombs, B.L. eds.How Students Learn: Reforming Schools Through Learner-Centered Education, American Psychological Association, Washington, DC, 1998.

136. Project Kaleidoscope. Project Kaleidoscope Report on Reports: Recommendations for Action in support of Undergraduate Science, Technology, Engineering, and Mathematics. Investing in Human Potential: Science and Engineering at the Crossroads, Washington, D.C., 2002. Retrieved January 16, 2006, from http://www.pkal.org/documents/RecommentdationsForActionInSupportOfSTEM.cfm.

138. Rossman, M.H. Successful online teaching using an asynchronous learner discussion forum. Journal of Asynchronous Learning Networks3 (2), from http://www.sloan-c.org/publications/jaln/v3n2/v3n2_rossman.asp.

139. Saba, F. Distance education theory, methodology, and epistemology: A pragmatic paradigm. in Moore, M.G. and Anderson, W.G. eds. Handbook of Distance Education, Lawrence Erlbaum Associates, Mahwah, New Jersey, 2003, 3-20.

141. Savery, J.R. and Duffy, T.M. Problem Based Learning: An Instructional Model and Its Constructivist Framework, Indiana University, Bloomington, IN, 2001.

146. Smith, D.J. Wanted: A New Psychology of Exemplars. Canadian Journal of Psychology59 (1). 47-55

148. Sveshnikov, A.A. Problems in probability theory, mathematical statistics and theory of random functions. Saunders, Philadelphia, 1968

149. Taraban, R., Rynearson, K. and Kerr, M. College students’ academic performance and self-reports of comprehension strategy use. Reading Psychology21 (4). 283-308.

153. Van Merrienboer, J.J.G. Training complex cognitive skills: A four-component instructional design model for technical training. Educational Technology Publications, Englewood Cliffs, NJ, 1997.

158. Williams, R.G. and Ware, J.E. An extended visit with Dr. Fox: Validity of student satisfaction with instruction ratings after repeated exposures to a lecturer. American Educational Research Journal14 (4). 449-457.

Thinking about the fraud against Target

January 22nd, 2014

I read an interesting article in the Wall Street Journal today: http://online.wsj.com/news/articles/SB10001424052702304027204579332990728181278?mod=%3C%25mst.param%28LINKMODPREFIX%29

Basically, the theory presented in the article is that there are these wonderful credit/debit cards with embedded chips that are much more secure than the current system. If only Target (and other retailers) had adopted these, we would have less fraud. Apparently, the fault lies with Target.

I imagine that the expected response to this article is “What were they thinking?” as the reader realizes that more-effective technology was at hand at what might have been a reasonable price.

I got to watch some of this play out in the late 1990’s. At the time, I was working as a technology-focused lawyer and one of the areas I worked on was electronic payment systems. I published a few papers on this. One available from my website appeared in 1998 in the Journal of Electronic Commerce, called “SPLAT! Requirements bugs on the information superhighway“, see http://kaner.com/pdfs/splat.pdf

The issues I wrote about in this (and related papers) involved the use of public-key encryption systems to guarantee identity. The same commercial-liability issues were coming up for chip cards, with the same rationale.

These systems offered the potential of significantly reducing fraud in consumer transactions. Fraud was seen as a big problem. With these savings of billions of dollars of losses, some credit card company representatives spoke of being able to noticeably lower their fees and interest rates. Who wouldn’t want that?

Unfortunately, some financial services firms (and some other folks) saw two opportunities here.

  1. They hated paying money to criminals committing fraud
  2. They hated guaranteeing every credit card transaction in the event of fraud—they wanted to put this risk back on the consumer but current legislation wouldn’t let them

The proposals to adopt encryption-based identification systems in commerce tied these together. The proposed laws would:

  1. authorize the use of encryption-based identification as equivalent to an ink signature
  2. treat the encryption-based identification as absolutely authoritative, so that if someone successfully impersonated you, you would bear all the loss. Current law sticks the financial-services firms with the risk of credit-card fraud losses because they design the system and decide how much security to build into it. The proposed new system would be an alternative to the consumer-protected credit-card system. It would flip the risk to the consumer.

I think legislation would have easily passed that provided incentives to adopt encryption-based identification. For example, the legislation could have created a “rebuttable presumption” — an instruction to a court to assume that a message encrypted with your key came from you and if you wanted to deny that, you would have to prove it.  This legislation would have reduced fraud, which would benefit everyone. (Well, everyone but the criminals…)

Unfortunately, the demand went further. Even if you could prove that you were the victim of identity theft that was in no way your fault, you would still be held accountable for the loss. 

The lawyers advocating for incentivizing encryption-based identification weren’t willing to separate the proposals. The result of their inflexibility was opposition to encryption-based payment-related identification systems (including chip cards). One dimension of the opposition was technical–the security of the payment systems was almost certainly less (and therefore the risk of fraud that was created by the system and not by negligence of the consumer was greater) than the most enthusiastic proponents imagined. Another dimension was irritation with what was perceived as greed and unwillingness to compromise.

Back then, I saw this play out because I was helping a committee write the Uniform Electronic Transactions Act (UETA). This eventually passed in most states and was then federalized under the name ESIGN. ESIGN now governs electronic payments in the United States. The multi-year drafting process that yielded UETA/ESIGN offered a unique opportunity to write incentives for stronger identification systems into the laws governing electronic payments. Instead, we chose to write legislation that accepted a status quo that involved too much fraud, with prospects of much worse fraud to come. I was one of the people who successfully encouraged the UETA drafting committee to take this less-secure route because there was no politically-feasible path to what seemed like the obvious compromise.

Our economy has benefited enormously from legislation that lets you buy something by clicking “I agree”, without having to sign a physical piece of paper with a physical ink-pen. We could have done this better. Instead, we accepted the predictable future outcome that the United States would continue to use insecure payment systems, that would result in ongoing fraud, like the latest attacks on Target, Neiman Marcus, and (apparently, according to recent reports) at least six other national retailers.

On the design of advanced courses in software testing

January 19th, 2014

This year’s Workshop on Teaching Software Testing (WTST 2014) is on teaching advanced courses in software testing. During the workshop, I expect we will compare notes on how we design/evaluate advanced courses in testing and how we recognize people who have completed advanced training.

This post is an overview of one of the two presentations I am planning for WTST.

This presentation will consider the design of the courses. The actual presentation will rely heavily on examples, mainly from BBST (Foundations, Bug Advocacy, Test Design), from our new Domain Testing course, and from some of my non-testing courses, especially statistics and metrics. The slides that go with these notes will appear at the WTST site in late January or early February.

In the education community, a discussion like this would come as part of a discussion of curriculum design. That discussion would look more broadly at the context of the curriculum decisions, often considering several historical, political, socioeconomic, and psychological issues. My discussion is more narrowly focused on the selection of materials, assessment methods and teaching-style tradeoffs in a specialized course in a technical field. The broader issues come into play, but I find it more personally useful to think along six narrower dimensions:

  • content
  • institutional considerations
  • skill development
  • instructional style
  • expectations of student performance
  • credentialing

Content

In terms of the definition of “advanced”, I think the primary agreement in the instructional community is that there is no agreement about the substance of advanced courses. A course can be called advanced if it builds on other courses. Under this institutional definition, the ordering of topics and skills (introductory to advanced) determines what is advanced, but that ordering is often determined by preference or politics rather than by principle.

I am NOT just talking here about fields whose curricula involve a lot of controversy. Let me give an example. I am currently teaching Applied Statistics (Computer Science 2410). This is parallel in prerequisites and difficulty to the Math department’s course on Mathematical Statistics (MTH 2400). When I started teaching this, I made several assumptions about what my students would know, based on years of experience with the (1970’s to 1990’s) Canadian curriculum. I assumed incorrectly that students would learn very early about the axioms underlying algebra—this was often taught as Math 100 (1st course in the university curriculum). Here, it seems common to find that material in 3rd year. I also assumed incorrectly that my students would be very experienced in the basics of proving theorems. Again mistaken, and to my shock, many CS students will graduate, having taken several required math courses, with minimal skills in formal logic or theorem proof. I’m uncomfortable with these choices (in the “somebody moved my cheese” sense of the word “uncomfortable”)—it doesn’t feel right, but I am confident that these students studied other topics instead, topics that I would consider 3rd-year or 4th-year. Even in math, curriculum design is fluid and topics that some of us consider foundational, others consider advanced.

In a field like ours (testing) that is far more encumbered with controversy, there is a strong argument for humility when discussing what is “foundational” and what is “advanced”.

Institutional Considerations

In my experience, one of the challenges in teaching advanced topics is that many students will sign up who lack basic knowledge and skills, or who expect to use this course as an opportunity to relitigate what they learned in their basic course(s). This is a problem in commercial and university courses, but in my experience, it is much easier to manage in a university because of the strength and visibility of the institutional support.

To make space for advanced courses, institutions that designate a courses as advanced are likely to

  • state and enforce prerequisites (courses that must be taken, or knowledge/skill that must be demonstrated before the student can enrol in the advanced course)
  • accept transfer credit (a course can be designated as equivalent to one of the institution’s courses and serve as a prerequisite for the advanced course)

The designation sets expectations. Typically, this gives instructors room to:

  1. limit class time spent rehashing foundational material
  2. address topics that go beyond the foundational material (whatever material this institution has designated as foundational)
  3. tell students who do not know the foundational material (or who cannot apply it to the content of the advanced course) that it is their responsibility to catch up to the rest of the class, not the course’s responsibility to slow down for them
  4. demand an increased level of individual performance from the students (not just work products on harder topics, but better work products that the student produces with less handholding from the instructor)

Note clearly that in an institution like a university, the decisions about what is foundational, what is advanced, and what prerequisites are required for a particular course are made by groups of instructors, not by the administrators of the institution. This is an idealized model–it is common for institutional administrators to push back, encouraging instructors to minimize the number of prerequisites they demand for any particular course and encouraging instructors to take a broader view of equivalence when evaluating transfer credits. But at its core, the administration adopts structures that support the four benefits that I listed above (and probably others). I think this is the essence of what we mean by “protecting the standards” of the institution.

Skill Development

I think of a skill as a type of knowledge that you can apply (you use it, rather than describe it) and your application (your peformance) improves with deliberate practice.

Students don’t just learn content in courses. They learn how to learn, how to investigate and find/create new ideas or knowledge on their own, how to find and understand the technical material of their field, how to critically evaluate ideas and data, how to communicate what they know, how to work with other students, and so on. Every course teaches some of these to some degree. Some courses are focused on these learning skills.

Competent performance in a professional field involves skills that go beyond the learning skills. For example, skills we must often apply in software testing include:

  • many test design techniques (domain testing, specification-based testing, etc.). Testers get better with these through a combination of theoretical instruction, practice, and critical feedback
  • many operational tasks (setting up test systems, running tests, noting what happened)
  • many advanced communication skills (writing that combines technical, persuasive and financial considerations)

Taxonomies like Bloom’s make the distinction between memorizable knowledge and application (which I’m describing as skill here). Some courses, and some exams, are primarily about memorizable knowledge and some are primarily about application.

In general, in my own teaching, I think of courses that focus on memorizable knowledge as survey courses (broad and shallow). I think of survey courses as foundational rather than advanced.

Most survey courses involve some application. The student learns to apply some of the content. In many cases, the student can’t understand the content without learning to apply it at least to simple cases. (In our field, I think domain testing–boundary and equivalence class analysis–is like this.) It seems to me that courses run on a continuum, how much emphasis on learning things you can remember and describe versus learning ways to apply knowledge more effectively. I think of a course that is primarily a survey course as a survey course, even if it includes some application.

Instructional Style

Lecture courses are probably the easiest to design and the easiest to sell. Commercial and university students seem to prefer courses that involve a high proportion of live lecture.

Lectures are effective for introducing students to a field. They introduce vocabulary (not that students remember much of it–they forget most of what they learn in lecture). They convey attitudes and introduce students to the culture of the field. They can give students the sense that this material is approachable and worth studying. And they entertain.

Lectures are poor vehicles for application of the material (there’s little space for students to try things out, get feedback and try them again).

In my experience, they are usually also poor vehicles for critical thinking (evaluating the material). Some lecturers develop a style that demands critical thinking from the students (think of law schools) but I think this requires very strong cultural support. Students understand, in law school, that they will flunk out if they come to class unprepared and are unwilling or unable to present and defend ideas quickly, in response to questions that might come from a professor at any time. Lawyers view the ability to analyze, articulate and defend in real time as a core skill in their field and so this approach to teaching is considered appropriate. In other fields that don’t prioritize oral argumentation so highly, a professor who relied on this teaching style and demanded high performance from every student, would be treated as unusual and perhaps inappropriate.

As students progress from basic to advanced, the core experiences they need to support further progress also change, from lecture to activities that require them to do more–more applications to increasingly complex tasks, more critical evaluation of what they are doing, what others are doing, and what they are being told to do or to accept as correct or wise. Fewer things are correct. More are better-for-these-reasons or better-for-these-purposes.

Expectations of Student Performance

More advanced courses demand that students take more responsibility for the quality of their work:

  • The students expect, and tolerate, less specific instructions. If they don’t understand the instructions, the students understand that it is their responsibility to ask for clarification or to do other research to fill in the blanks.
  • The students don’t expect (or know they are not likely to get) worked examples that they can model their answers from or rubrics (step-by-step evaluation guides) that they can use to structure their answers. These are examples of scaffolding, instructional support structures to help junior students accomplish new things. They are like the training wheels on bicycles. Eventually, students have to learn to ride without them. Not just how to ride down this street for these three blocks, but how to ride anywhere without them. Losing the scaffolding is painful for many students and some students protest emphatically that it is unfair to take these away. I think the trend in many universities has been to provide more scaffolding for longer. This cuts back on student appeals and seems to please accreditors (university evaluators) but I think this delays students’ maturation in their field (and generally in their education).

One of the puzzles of commercial instruction is how to assess student performance. We often think of assessment in terms of passing or failing a course. However, assessment is more broadly important, for giving a student feedback on how well she knows the material or how well she does a task. There has been so much emphasis on high-stakes assessment (you pass or you fail) in academic instruction that many students don’t understand the concept of formative assessment (assessment primarly done to give the student feedback in order to help the student learn). This is a big issue in university instruction too, but my experience is that commercial students are more likely to be upset and offended when they are given tough tasks and told they didn’t perform well on them. My experience is that they will make more vehement demands for training wheels in the name of fairness, without being willing to accept the idea that they will learn more from harder and less-well-specified tasks.

Things are not so well specified at work. More advanced instruction prepares students more effectively for the uncertainties and demands of real life. I believe that preparation involves putting students into uncertain and demanding situations, helping them accept this as normal, and helping them learn to cope with situations like these more effectively.

Credentialing

Several groups offer credentials in our field. I wrote recently about credentialing in software testing at http://kaner.com/?p=317. My thoughts on that will come in a separate note to WTST participants, and a separate presentation.