History of free-response research
The term "free-response" was coined by Egan in 1961 in connection with studies involving the detection of brief audio tone(s) against a noise background. The tone(s) could occur at any instant within an active listening interval (e.g., while an indicator light was on) and the listener's task was to respond by pressing a button at any instant when a tone was perceived. Therefore the number of responses per active interval could be ≥ 0 and was a-priori unpredictable. The listener was uncertain how many any true tones (signals), if any, could occur in the active interval and when they might occur. With two-dimensional space replacing time and excepting for the causality effect the acoustic study mirrors a common task in medical imaging, namely finding lesions in patient images. In interpreting an image for possible breast cancer the mammographer does not know a-priori how many lesions are present, if any, and where they could be located. The image is searched for findings, i.e., regions that appear suspicious for cancer. If the level of suspicion of a particular finding exceeds the clinical threshold the mammographer reports it. Conceptually a radiology report consists of the locations of regions that exceeded the threshold and the corresponding levels of suspicion. This type of information defines the free-response paradigm.
The importance of the free-response paradigm for radiology applications was first recognized by Bunch et al. Their classic paper describes several problematical issues when the ROC method is applied to a localization task. A well-known one is the ambiguity when a false positive and a false negative occur on the same image. The two mistakes effectively "cancel" each other and in ROC analysis the image is scored as a true positive. In other words the radiologist was right – he detected that the image was abnormal – but for the wrong reason – he reported the incorrect location. The clinical consequences of the two canceling mistakes can be serious. Bunch et al introduced the concept of the free-response receiver operating characteristic (FROC) curve, defined as a plot of lesion localization fraction (LLF) vs. mean number of false positives per image, and demonstrated it with a free-response study with x-ray images of small beads superimposed on a uniform phantom. They also developed theoretical relations between the FROC curve and ROC curve implied by assignment of the rating of the highest rated mark as the "ROC equivalent" rating of the image.
In 1986 my colleagues and I at the University of Alabama at Birmingham reported the first clinical FROC study which compared conventional chest radiography to a prototype digital chest imaging system by Picker International. The method was applied in a second study to evaluate a dual-energy chest imaging system by Picker International. Since no method was then available for fitting the free-response data, the analysis was necessarily crude, by present day standards, and involved interpolating between neighboring data points to allow comparison of LLF at a common value of number of false positives per image. Around 1989, two methods (13, 14) and software for fitting free-response data, termed FROCFIT and alternative free-response operating characteristic (AFROC) analysis, were developed by me. These gave reasonable fits to human observer data. In 1996 Swensson described a model for the LROC paradigm that also predicted FROC curves. The FROCFIT, AFROC and LROC models are identical and the approaches differ only in the estimation methods. While all of these methods gave good fits to human observer data, they failed to fit CAD data in the low-confidence region. CAD algorithms yield many marks and the ratings are finely spaced floating-point numbers which means a detailed FROC curve (often referred to as the "raw" curve) with closely spaced operating points is available. With human observers it is rarely possible to get more than 4 operating points. Deviation of the fit from the actual operating points is easier to see with CAD data than with human observer data.
All of the models described so far assumed independence of the ratings of multiple marks on the same image. This assumption drew justifiable criticism that has discouraged usage of these methods. It was believed that staying within the well-established framework of ROC methodology was preferable even when it meant ignoring the location information. Many intrinsically free-response studies have been analyzed by the ROC method. CAD evaluation at the designer level has been a notable exception to the general reluctance to use the free-response paradigm. In CAD work FROC curves have been the primary method of summarizing performance. The CAD designer has access to the locations and corresponding levels of suspicion (often termed malignancy index). This type of detailed information is termed designer-level CAD data. In mammography CAD systems the average number of marks per image is about 10 and ROC methodology is clearly inadequate. However, due to the lack of suitable analytical tools most developers have had to resort to reporting the detection fraction and the corresponding number of false positives per image, i.e., a particular operating point on the FROC curve, similar to that used in the earliest applications of the free-response method.
The importance of assessing mammography CAD algorithms has spurred research in free-response methodology. Another driving factor is the importance of evaluating CAD performance in low-dose CT screening for lung cancer where an algorithm may identify anywhere from 4-20 suspicious regions per patient. In any case in recent years the pace of research in this area has picked up considerably. In 2002 the initial detection and candidate analysis (IDCA) method was proposed for fitting CAD-generated FROC curves, thereby formalizing an ad-hoc procedure that was being used by algorithm designers. IDCA was the first model to predict that the FROC curve may not extend to large values of false positive fraction per image and that not all lesions are necessarily detected, even at the lowest confidence level corresponding to marking all suspicious regions. The significance of these predictions was not recognized until quite recently for they implied a model of image interpretation that is very non-intuitive to most imaging physicists but is quite familiar to vision psychologists and is explicated in the Kundel-Nodine model of how radiologists search images. According to this model radiologists do not perform an exhaustive search of the image. Rather, suspicious regions are identified during a holistic phase and in a subsequent cognitive phase decisions are made at the regions identified during the holistic phase. A more general implementation of the Kundel-Nodine model, termed the search model (SM) has been described recently. Like IDCA it is a parametric model and predicts FROC curves that do not extend infinitely to the right. Several other approaches have also been proposed recently.
An issue with much of the earlier work has been the independence assumption. Re-sampling techniques such as bootstrap and jackknife that do not assume independence have been available but were not applied to free-response data until quite recently. Modern free-response data analysis uses resampling and is quite robust to data clustering; therefore the original objection to free-response studies is no longer valid. In 2004 the jackknife analysis of free-response data (JAFROC) was proposed for human observer data, and JAFROC is being increasingly used in human observer free-response studies. Recently a boot-strap based non-parametric method has been described for CAD evaluation.