Scientific Evidence: Home Page


Performance (or Proficiency) Tests of Black-Box Truth Detectors

There have been recurring proposals to determine the accuracy of (alleged) black box-truth detectors -- e.g., polygraph operators, psychologists' predictions, handwriting experts -- by subjecting such truth detectors to performance tests. (When the black box involves the performance of a human being, it is usually said that the human black box should be subjected to double-blind proficiency tests.) See, e.g., Jennifer Mnookin, Of Black Boxes, Instruments, and Experts: Testing the Validity of Forensic Science 5 Episteme 343 (2008); Michael Risinger, Michael D. Risinger, et al., "The Daubert/Kumho Implications of Observer Effects in Forensic Science: Hidden Problems of Expectation and Suggestion," 90 California Law Review 1 (2002). Jennifer Mnookin's abstract states: "This paper argues that judges assessing the scientific validity and the legal admissibility of forensic science techniques ought to privilege testing over explanation. Their evaluation of reliability should be more concerned with whether the technique has been adequately validated by appropriate empirical testing than with whether the expert can offer an adequate description of the methods she uses, or satisfactorily explain her methodology or the theory from which her claims derive."

Such proposals are refreshing: We should be skeptical of the claims of gang experts, polygraph operators, and the like. But the proposals for proficieny or performance testing do raise some conceptual and, perhaps,some practical difficulties. Consider the following correspondence:


(My) Message 1

Dear [YYYY],

Now to the question of the testing the proficiency of black boxes. I have in mind the problem of measuring the proficiency of "experienced-based" experts, soft sciences, inarticulate putative experts, oracular expertise, and the like.

Genuinely black -- entirely impenetrable, wholly inexplicable -- information processing boxes ("evidence evaluation boxes"?) should not work well.

Consider, first, a black box whose job it is to determine if marbles were or were not made by X.

One day, a Monday, an experimenter goes to X marble factory, purchases thousands and thousands of marbles made by X, she goes elsewhere to purchase thousands of marbles not made by X, she puts all of those marbles into her box, and observes what happens. What happens is that 98% of the time the box emits "X" when the marble was in fact made by X and 95% of the time it emits "not-X" when the marble was not made by X. (Everyone agrees those error rates are hunky-dory.)

The experimenter examines the marbles again and discovers that marbles made by X have little x's imprinted on them and those not made by X have no such x's. She suspects that the box does pretty dandy -- but not perfect -- job of detecting x's and their absence.

A week later the experimenter repeats the experiment. She gets almost identical results. (The "almost" worries her, but not overly so. She figures that perhaps she counted the results inaccurately once or twice and, because of an electrical surge, the box didn't function as intended.)

The following week the experimenter runs the experiment again. But she does so on Tuesday because on Monday she feels rotten. When she does so, an astonishing thing happens: if a box is not made by X, the machine still reports "not-X" 95% of the time, but if a marble was made by X, the box utters "not-X" 98% of the time.

The experimenter's first reaction is, "No matter. If I can specify the circumstances under which the machine utters 'not-X' with a specified frequency when a marble was made by X as well as the circumstances when the box almost always utters 'X' when a marble was made by X, the box will, if inadvertently (so to speak), have an excellent effective hit rate.

The experimenter decides to run the experiment again. Next week she again feels rotten on Monday, her usual day for running experiments. She again decides to run the experiment the following day. She gets the same results she got in Week 3.

The experimenter has a flash: Could it be that the day of the week the experiment is run is the reason for the difference? She thinks, "That should not be. But maybe."

The experimenter runs the experiment again the following Tuesday. She again gets the results she got in Week 3 and Week 4.

She looks at the marbles again. She notices that each of the marbles made by X on Mondays is imprinted with a tiny m. Aha, she says to herself, when a marble has an "x" and an "m", the box utters "X" 98% of the time and when a marble has an "x" but not an "m", the machine utters "not-X" 98% of the time.

The following week she runs the experiment again. But, having learned her lesson, she decides to run it on Wednesday. Good thing too. This time she notices that the box utters "X" almost exactly 50% of the time when marbles were made by X (and "not-X" almost exactly 50% of the time [for marbles made by X]). She is astonished. But she quickly recovers her composure. She thinks, "Aha, plainly it matters not just whether the day is Monday or not Monday; the specific day of the week -- Tuesday, Wednesday, etc. -- matters. And if I'm right,  this 50-50 is also potentially a pretty darned good probabilistic indicator of the manufacturer of a marble." (The box continues to utter "not-X" almost exactly 95% of the time for marbles that are not made by X.) She looks at the marbles from X she got on Wednesday and she also looks again at the marbles she got from X on Tuesday. She discovers that each marble made by X on Wednesday is imprinted with a small w as well as with a small x and each marble made by X on Tuesday is imprinted with a small t as well as a small x.)

She suspects that for some reason for marbles made by X the box's rate of accurately reporting a hit depends on the day company X manufactures a marble. Sure enough, marbles made on Thursday have "th" imprinted on them and marbles made on Friday have "f" imprinted on them (as well as x). [The company makes no marbles on weekends.] For several weeks she runs these marble experiments and she determines the rates of reported hits (for marbles made by X) are again almost exactly the same for marbles manufactured on any particular day of the week (and substantially different from the box's reported hit rate for marbles not made by X).

She folds up her lab, goes to the Bahamas for a vacation, and publishes her results.

Out of curiosity, a year later she re-runs her experiment one week. She is distressed to find that the box now seems always to make random choices of saying "X" and "not-X" for marbles made by X. This time, instead of saying, "No matter. A really true random number of generator might be quite diagnostic of whether a marble was made by X," she thinks, "Well, gee, perhaps X changed its method of manufacture." But, on a hunch, she instead opens the box and peers into its innards. What she sees changes everything. She sees:

Option 1. She sees that the box has a program that says, "For marbles with 'x' on them, report 'X' 98% of the time when you see 'm', report 'X' 2% of the time when you see 't', report 'X' 50% of the time when you see 'w', report  'X' 33% of the time when you see 'th' ... but do this only for 52 weeks. After that choose at random between saying 'X' and 'not-X'.

Option 2. She sees that the box has a program that says, "For marbles with 'x' on them, report 'X' 98% of the time when you see 'm', report 'X' 2% of the time when you see 't', report 'X' 50% of the time when you see 'w', report  'X' 33% of the time when you see 'th' ... but do this only for 52 weeks. After that choose at random between saying 'X' and 'not-X' for 52 weeks. After that revert to your original instructions."

Option 3.  She sees that the box has a program that says, "For marbles with 'x' on them, report 'X' 98% of the time when you see 'm', report 'X' 2% of the time when you see 't', report 'X' 50% of the time when you see 'w', report  'X' 33% of the time when you see 'th' ... Do this for 365 days/year.  If you can't follow these instructions on any given day, thereafter choose at random between saying 'X' and 'not-X'." [Perhaps the programmer forgot about leap years.]

Moral 1 of this semi-finished story [we could take the the presence or absence of x out of the picture and I would do so if I had more time]: If the experimenter knows (or suspects) nothing about the workings of the box, she should not be surprised by any changes or variations in the reporting behavior of the box. The box, after all, could (by hypothesis) be programmed to run in any way whatever.

But the second possible moral of my semi-finished story is, I think, the more important and interesting one: Black boxes in many instances are not completely black. We do not understand perfectly how supposedly black boxes (e.g., fingerprint examiners) work but we have hunches or some guesses and those intuitions (or semi-perfect understandings) allow us to diminish in our minds that the probability that some changes in circumstances will change the box's hit rate.

But since our knowledge of the workings of black box is imperfect, we're often not sure of our hunches, how widely our hunches about a box should range, how widely our hunches will be shared or should be shared by other experimenters or observers, etc. We're in a pickle.

  • But -- nonetheless -- when we think to ourselves, "That's silly. That change in circumstances can't have any effect." We will, if given the opportunity, usually adhere to hunches such as this.

  • &&&

    The problem of imperfect but genuine knowledge -- genuine but imperfect knowledge -- is a pervasive problem for inference and epistemology. Note: It is true that humans get by, often reasonably well, with imperfect knowledge. But this observation leaves the key problem unanswered: How do we, with our imperfect knowledge, grade the performance of black boxes that we worry may not work as well as advertised?

    &&&

    N.B. We darned well ought to test people such as handwriting experts and fingerprint examiners and collect statistics about their proficiency and all that. But we -- someone other than people who are "invested" in the thesis of the genuineness of handwriting expertise -- should insist on measuring such success and failure rates. (Of course, we then again face the problem of figuring out how to measure such failure rates: How can we ignorant or semi-ignorant people do that properly? But this problem recurs -- in principle -- in every attempt to measure the success or failure of an information processing or evidence evaluation machine or creature.)

    Message 2

    Dear Peter,

    Isn’t your very interesting parable grue/bleen in a black box? [See Nelson Goodman, Fact, Fiction, and Forecast (2d ed., 1965).]

    [XXXXXX}


    Message 3

    [XXXXX],

    By George, I think you're correct. But, gee whiz, do you (and I) have to be so concise? I rather enjoyed rambling on until I figured out, more or less, what I was trying to say.

    Peter


    Message 4

    Dear Peter and YYY (and ZZZ too):

    After my rather flip grue/bleen observation, I rolled up my sleeves and have spent a bit of time reflecting on the issues raised by Peter’s parable.  It seems to me that in most situations requiring, or at least suggesting the desirability of, black box testing, we are not dealing with the box simpliciter, but with a box coupled with claims about the box by some other human.  These claims generally identify inputs and claimed outputs, and can be usefully divided into performance claims and explanatory claims.  Performance claims describe the asserted outputs given identified inputs.  Explanatory claims attempt to account for why the claimed performance happens.  Explanatory claims may seem facially plausible, like (in some ways) the claims of handwriting experts that if you apply their standard analytic method to a static trace, the distinctive signal that marks the writer can be discerned by those experienced in the craft, or they may seem implausible, like Reich’s claims about the orgone box (a literal box, though I don’t think it was black.).  The plausibility of the explanatory claims might serve a purpose in allocating scarce resources under some circumstances (test the plausible claims first), but sometimes testing of the performance claims is demanded by other circumstances, like the fact that the performance claims have become widely accepted because lots have people have found the explanatory claims persuasive (often, to my mind, irrationally) and this is having other effects (people are being convicted, money is changing hands in large quantities, etc).  Black box testing is directed at verifying or falsifying (or qualifying) the performance claims.  It treats the medium of performance (even if it is a human instead of a drug sniffing dog) as an instrument to be tested as an instrument.  And it is important to remember that even if the performance claims turn out to be right, this does not by itself validate the explanatory claims.  Chiropractors may be effective at relieving low back pain, but not because the theory of subluxations is a very well warranted explanatory theory (to say the least).  When we finally penetrate the box with a more detailed set of studies and develop a reasonably well warranted explanatory theory, it may look nothing like the theory propounded by the practitioners.  This has often been the case with claims from folk medicine, for instance.

    So to conclude, I think we often start with a pretty opaque box, but also with a set of performance claims that are perfectly amenable to testing, however we account for what is going on in the box at some time in the perhaps distant future.

    Best

    [XXXXX]


    Message 5

    Dear [XXXX] (& All),

    I agree with your [XXX's] comment in large part. In my original statement of my hypothetical problem (which, by the way, is in part ill-posed because of my introduction of varying rules for the reporting of hits) I did not make clear that I was considering the "proficiency" testing of a truly black box, a black box whose internal workings are unknown, and not the testing of a black box whose ostensible operating system only is unknown. I do agree of course with XXX that the ostensible operating system of a box can differ radically from its real operating system. However, I would still argue (and XXX seems to agree with this) that it is hard (and in a sense theoretically impossible) to test the "proficiency" of a box if one has no inkling at all of what its (actual) operating system might be. (My thesis here is not much different from the familiar critique of the inference, "I've lived this long. So I'll live forever.")

    My real "trouble" is that I also reject the claims of folk (e.g., Judea Pearl, Richard Wright [the law teacher who wrote a law review article about this]) who sometimes seem to argue that inference without correct knowledge of causes is impossible. Human beings are indeed creatures who have only partial knowledge [including, e.g., of causes] and yet they manage to draw accurate inferences "reasonably often."

    As I see it, if we are to devise a general recipe to improve the capacity of human beings to improve the quality of their inferences, we have to understand why cognitively imperfect creatures manage to draw accurate inferences -- and arrive at new insights -- as often as they do. This thesis, as abstract and grandiose as it might sound, ultimately does have a bearing on the specific question of how a society should test, or measure, the proficiency of ostensible experts in matters such as handwriting identification, fingerprint identification, and "syndromes" (for whatever purpose they may be offered [and how does one measure the accuracy of a matter -- syndrome -- that is ostensibly offered only to provide an "explanation" {there is much junk in the usual way that syndrome experts use "explanation"}).

    I leave open the possibility that -- to use my own language -- we will never "devise a general recipe to improve the capacity of human beings to improve the quality of [human] inferences." At present I'm agnostic on this question.

    Peter 


    Consider also my ruminations about the difficulty of determining whether whether or not some fMRI-based lie detector or similar device works:

    Friday, October 16, 2009

    Brain Science: A Meditation on Mechanical Lie Detection

    Drawing Inferences about "Deception" from Observed Events in the Brain:
    Of fMRI and Similar Purported Tools for Observing or Inferring States of the Human Mind and Heart

    Suppose that the levels (Level 1, etc.) shown in Figure 1 below are levels of the physical or material structure of the human brain.

    Suppose that the small italicized letters below (f, g, etc.) represent events at a particular physical level of the brain; e.g., f may represent an electrical signal at a synapse.

    Suppose that events of type f are observed at Level 1 and observations done to date shows that the pattern of f events at Level 1 occurs (or: has occurred, has been observed to occur) when deception occurs (or: has occurred, has been observed to occur, has been thought to occur).

    What can one infer from those observations?

     
    Figure 1

    Suppose further that events at Level 1 in Figure 1 are determined by events at Level 2.

    Does it matter for inferences about deception that events at Level 2 have not been observed in relationship to instances of deception?

    Possibly.

    It is logically possible that more that one pattern of events at Level 2 can produce the pattern of events at Level 1 but that only one pattern of events at Level 2 is related to observed instances of deception.

    But the same relationship may hold between (i) events at any level and events at a lower level and (ii) events at any lower level and instances of deception. Thus, while only some patterns of events at Level 2 are associated with deception, the probative force of those patterns can in their turn fall victim to the (possible) fact that only one pattern of events at Level 3 is associated with instances of deception. These relationships may be repeated down to the "bottom," which in the above table are quantum processes and events. If so, invariant connections between events and instances of deception cannot be established unless and until observations of quantum level events have been observed.

    But it is possible that the structure of the workings of the brain is more complicated (and, perhaps, also less bottom-up driven than is the case with Figure 1). It is possible -- is it possible? -- there are interactions between different Levels of the brain, interactions that affect the pattern of events at each Level of the brain. For example:




    Figure 2
  • The interaction hypothesized in Figure 2 assumes that the processes at each level are not "deterministic" within each level. However, the logic of Figure 2 does not preclude the possibility that all the processes in Figure 1, taken together, are "deterministic" (even if only probabilistically so).
  • And then, of course, it is possible (is it possible?) there are interactions among different parts of the brain (which I will assume, for the sake of convenience, have four levels [though it is practically certain that each part of the brain has more than four levels of "existence"]):




    Figure 3

    If the sort of interaction shown in Figure 3 happens, inferences drawn from any pattern of events at Level 1 of Figure 1 cannot be drawn with certainty or, probably, even with near-certainty. (However, it does not necessarily follow that we learn nothing from observing events at Level 1 in Figure 1. Whether that's the case or not depends -- on many things.)

     

     

     


     

    Scientific Evidence: Home Page