Scientific Evidence: Home Page
There have been recurring proposals to determine the accuracy of (alleged) black box-truth detectors -- e.g., polygraph operators, psychologists' predictions, handwriting experts -- by subjecting such truth detectors to performance tests. (When the black box involves the performance of a human being, it is usually said that the human black box should be subjected to double-blind proficiency tests.) See, e.g., Jennifer Mnookin, Of Black Boxes, Instruments, and Experts: Testing the Validity of Forensic Science 5 Episteme 343 (2008); Michael Risinger, Michael D. Risinger, et al., "The Daubert/Kumho Implications of Observer Effects in Forensic Science: Hidden Problems of Expectation and Suggestion," 90 California Law Review 1 (2002). Jennifer Mnookin's abstract states: "This paper argues that judges assessing the scientific validity and the legal admissibility of forensic science techniques ought to privilege testing over explanation. Their evaluation of reliability should be more concerned with whether the technique has been adequately validated by appropriate empirical testing than with whether the expert can offer an adequate description of the methods she uses, or satisfactorily explain her methodology or the theory from which her claims derive."
Such proposals are refreshing: We should be skeptical of the claims of gang experts, polygraph operators, and the like. But the proposals for proficieny or performance testing do raise some conceptual and, perhaps,some practical difficulties. Consider the following correspondence:
Dear [YYYY],
Now to the question of the testing the proficiency of black boxes. I have in mind the problem of measuring the proficiency of "experienced-based" experts, soft sciences, inarticulate putative experts, oracular expertise, and the like.
Genuinely black -- entirely impenetrable, wholly inexplicable -- information processing boxes ("evidence evaluation boxes"?) should not work well.
Consider, first, a black box whose job it is to determine if marbles were or were not made by X.
One day, a Monday, an experimenter goes to X marble factory, purchases thousands and thousands of marbles made by X, she goes elsewhere to purchase thousands of marbles not made by X, she puts all of those marbles into her box, and observes what happens. What happens is that 98% of the time the box emits "X" when the marble was in fact made by X and 95% of the time it emits "not-X" when the marble was not made by X. (Everyone agrees those error rates are hunky-dory.)
The experimenter examines the marbles again and discovers that marbles made by X have little x's imprinted on them and those not made by X have no such x's. She suspects that the box does pretty dandy -- but not perfect -- job of detecting x's and their absence.
A week later the experimenter repeats the experiment. She gets almost identical results. (The "almost" worries her, but not overly so. She figures that perhaps she counted the results inaccurately once or twice and, because of an electrical surge, the box didn't function as intended.)
The following week the experimenter runs the experiment again. But she does so on Tuesday because on Monday she feels rotten. When she does so, an astonishing thing happens: if a box is not made by X, the machine still reports "not-X" 95% of the time, but if a marble was made by X, the box utters "not-X" 98% of the time.
The experimenter's first reaction is, "No matter. If I can specify the circumstances under which the machine utters 'not-X' with a specified frequency when a marble was made by X as well as the circumstances when the box almost always utters 'X' when a marble was made by X, the box will, if inadvertently (so to speak), have an excellent effective hit rate.
The experimenter decides to run the experiment again. Next week she again feels rotten on Monday, her usual day for running experiments. She again decides to run the experiment the following day. She gets the same results she got in Week 3.
The experimenter has a flash: Could it be that the day of the week the experiment is run is the reason for the difference? She thinks, "That should not be. But maybe."
The experimenter runs the experiment again the following Tuesday. She again gets the results she got in Week 3 and Week 4.
She looks at the marbles again. She notices that each of the marbles made by X on Mondays is imprinted with a tiny m. Aha, she says to herself, when a marble has an "x" and an "m", the box utters "X" 98% of the time and when a marble has an "x" but not an "m", the machine utters "not-X" 98% of the time.
The following week she runs the experiment again. But, having learned her lesson, she decides to run it on Wednesday. Good thing too. This time she notices that the box utters "X" almost exactly 50% of the time when marbles were made by X (and "not-X" almost exactly 50% of the time [for marbles made by X]). She is astonished. But she quickly recovers her composure. She thinks, "Aha, plainly it matters not just whether the day is Monday or not Monday; the specific day of the week -- Tuesday, Wednesday, etc. -- matters. And if I'm right, this 50-50 is also potentially a pretty darned good probabilistic indicator of the manufacturer of a marble." (The box continues to utter "not-X" almost exactly 95% of the time for marbles that are not made by X.) She looks at the marbles from X she got on Wednesday and she also looks again at the marbles she got from X on Tuesday. She discovers that each marble made by X on Wednesday is imprinted with a small w as well as with a small x and each marble made by X on Tuesday is imprinted with a small t as well as a small x.)
She suspects that for some reason for marbles made by X the box's rate of accurately reporting a hit depends on the day company X manufactures a marble. Sure enough, marbles made on Thursday have "th" imprinted on them and marbles made on Friday have "f" imprinted on them (as well as x). [The company makes no marbles on weekends.] For several weeks she runs these marble experiments and she determines the rates of reported hits (for marbles made by X) are again almost exactly the same for marbles manufactured on any particular day of the week (and substantially different from the box's reported hit rate for marbles not made by X).
She folds up her lab, goes to the Bahamas for a vacation, and publishes her results.
Out of curiosity, a year later she re-runs her experiment one week. She is distressed to find that the box now seems always to make random choices of saying "X" and "not-X" for marbles made by X. This time, instead of saying, "No matter. A really true random number of generator might be quite diagnostic of whether a marble was made by X," she thinks, "Well, gee, perhaps X changed its method of manufacture." But, on a hunch, she instead opens the box and peers into its innards. What she sees changes everything. She sees:
Option 1. She sees that the box has a program that says, "For marbles with 'x' on them, report 'X' 98% of the time when you see 'm', report 'X' 2% of the time when you see 't', report 'X' 50% of the time when you see 'w', report 'X' 33% of the time when you see 'th' ... but do this only for 52 weeks. After that choose at random between saying 'X' and 'not-X'.Option 2. She sees that the box has a program that says, "For marbles with 'x' on them, report 'X' 98% of the time when you see 'm', report 'X' 2% of the time when you see 't', report 'X' 50% of the time when you see 'w', report 'X' 33% of the time when you see 'th' ... but do this only for 52 weeks. After that choose at random between saying 'X' and 'not-X' for 52 weeks. After that revert to your original instructions."
Option 3. She sees that the box has a program that says, "For marbles with 'x' on them, report 'X' 98% of the time when you see 'm', report 'X' 2% of the time when you see 't', report 'X' 50% of the time when you see 'w', report 'X' 33% of the time when you see 'th' ... Do this for 365 days/year. If you can't follow these instructions on any given day, thereafter choose at random between saying 'X' and 'not-X'." [Perhaps the programmer forgot about leap years.]
Moral 1 of this semi-finished story [we could take the the presence or absence of x out of the picture and I would do so if I had more time]: If the experimenter knows (or suspects) nothing about the workings of the box, she should not be surprised by any changes or variations in the reporting behavior of the box. The box, after all, could (by hypothesis) be programmed to run in any way whatever.
But the second possible moral of my semi-finished story is, I think, the more important and interesting one: Black boxes in many instances are not completely black. We do not understand perfectly how supposedly black boxes (e.g., fingerprint examiners) work but we have hunches or some guesses and those intuitions (or semi-perfect understandings) allow us to diminish in our minds that the probability that some changes in circumstances will change the box's hit rate.
But since our knowledge of the workings of black box is imperfect, we're often not sure of our hunches, how widely our hunches about a box should range, how widely our hunches will be shared or should be shared by other experimenters or observers, etc. We're in a pickle.
But -- nonetheless -- when we think to ourselves, "That's silly. That change in circumstances can't have any effect." We will, if given the opportunity, usually adhere to hunches such as this.
The problem of imperfect but genuine knowledge -- genuine but imperfect knowledge -- is a pervasive problem for inference and epistemology. Note: It is true that humans get by, often reasonably well, with imperfect knowledge. But this observation leaves the key problem unanswered: How do we, with our imperfect knowledge, grade the performance of black boxes that we worry may not work as well as advertised?
Dear Peter,
Isn’t your very interesting parable grue/bleen in a black box? [See Nelson Goodman, Fact, Fiction, and Forecast (2d ed., 1965).]
[XXXXXX}
[XXXXX],
By George, I think you're correct. But, gee whiz, do you (and I) have to be so concise? I rather enjoyed rambling on until I figured out, more or less, what I was trying to say.
Peter
Dear Peter and YYY (and ZZZ too):
After my rather flip grue/bleen observation, I rolled up my sleeves and have spent a bit of time reflecting on the issues raised by Peter’s parable. It seems to me that in most situations requiring, or at least suggesting the desirability of, black box testing, we are not dealing with the box simpliciter, but with a box coupled with claims about the box by some other human. These claims generally identify inputs and claimed outputs, and can be usefully divided into performance claims and explanatory claims. Performance claims describe the asserted outputs given identified inputs. Explanatory claims attempt to account for why the claimed performance happens. Explanatory claims may seem facially plausible, like (in some ways) the claims of handwriting experts that if you apply their standard analytic method to a static trace, the distinctive signal that marks the writer can be discerned by those experienced in the craft, or they may seem implausible, like Reich’s claims about the orgone box (a literal box, though I don’t think it was black.). The plausibility of the explanatory claims might serve a purpose in allocating scarce resources under some circumstances (test the plausible claims first), but sometimes testing of the performance claims is demanded by other circumstances, like the fact that the performance claims have become widely accepted because lots have people have found the explanatory claims persuasive (often, to my mind, irrationally) and this is having other effects (people are being convicted, money is changing hands in large quantities, etc). Black box testing is directed at verifying or falsifying (or qualifying) the performance claims. It treats the medium of performance (even if it is a human instead of a drug sniffing dog) as an instrument to be tested as an instrument. And it is important to remember that even if the performance claims turn out to be right, this does not by itself validate the explanatory claims. Chiropractors may be effective at relieving low back pain, but not because the theory of subluxations is a very well warranted explanatory theory (to say the least). When we finally penetrate the box with a more detailed set of studies and develop a reasonably well warranted explanatory theory, it may look nothing like the theory propounded by the practitioners. This has often been the case with claims from folk medicine, for instance.
So to conclude, I think we often start with a pretty opaque box, but also with a set of performance claims that are perfectly amenable to testing, however we account for what is going on in the box at some time in the perhaps distant future.
Best
[XXXXX]
Dear [XXXX] (& All),
I agree with your [XXX's] comment in large part. In my original statement of my hypothetical problem (which, by the way, is in part ill-posed because of my introduction of varying rules for the reporting of hits) I did not make clear that I was considering the "proficiency" testing of a truly black box, a black box whose internal workings are unknown, and not the testing of a black box whose ostensible operating system only is unknown. I do agree of course with XXX that the ostensible operating system of a box can differ radically from its real operating system. However, I would still argue (and XXX seems to agree with this) that it is hard (and in a sense theoretically impossible) to test the "proficiency" of a box if one has no inkling at all of what its (actual) operating system might be. (My thesis here is not much different from the familiar critique of the inference, "I've lived this long. So I'll live forever.")
My real "trouble" is that I also reject the claims of folk (e.g., Judea Pearl, Richard Wright [the law teacher who wrote a law review article about this]) who sometimes seem to argue that inference without correct knowledge of causes is impossible. Human beings are indeed creatures who have only partial knowledge [including, e.g., of causes] and yet they manage to draw accurate inferences "reasonably often."
As I see it, if we are to devise a general recipe to improve the capacity of human beings to improve the quality of their inferences, we have to understand why cognitively imperfect creatures manage to draw accurate inferences -- and arrive at new insights -- as often as they do. This thesis, as abstract and grandiose as it might sound, ultimately does have a bearing on the specific question of how a society should test, or measure, the proficiency of ostensible experts in matters such as handwriting identification, fingerprint identification, and "syndromes" (for whatever purpose they may be offered [and how does one measure the accuracy of a matter -- syndrome -- that is ostensibly offered only to provide an "explanation" {there is much junk in the usual way that syndrome experts use "explanation"}).
I leave open the possibility that -- to use my own language -- we will never "devise a general recipe to improve the capacity of human beings to improve the quality of [human] inferences." At present I'm agnostic on this question.
Peter
Consider also my ruminations about the difficulty of determining whether whether or not some fMRI-based lie detector or similar device works:

And then, of course, it is possible (is it possible?) there are interactions among different parts of the brain (which I will assume, for the sake of convenience, have four levels [though it is practically certain that each part of the brain has more than four levels of "existence"]):The interaction hypothesized in Figure 2 assumes that the processes at each level are not "deterministic" within each level. However, the logic of Figure 2 does not preclude the possibility that all the processes in Figure 1, taken together, are "deterministic" (even if only probabilistically so).
