How to conduct a free-response study
General
It is important to realize that while the free-response method is closer to the clinical paradigm, it is not the "real thing". Like ROC, LROC and the ROI methods it is a laboratory study. There is an assumption that laboratory studies correlate with performance in the clinic. Having committed to a laboratory study it is important to conduct the study in as optimal a manner as is possible. This is not to suggest that free-response (or for that matter ROC) methods are not applicable to clinical studies. Most often the interest is in a relative comparison of two modalities. It is hard to imagine a scenario where a laboratory study would give a different and statistically significant ordering than a real clinical study. I have not seen any data that shows this type of inconsistency.
Truth (gold standard)
For normal images radiological truth can be retrospectively established by follow-up mammograms interpreted as normal. Malignant lesions are proven by biopsy. The truth-panel consisting of expert mammographers should locate the known malignancies on the mammograms using all available clinical information. They should indicate the lesion boundary by outlining it. This is preferable to simply indicating the center, as the latter may not be well defined, especially for larger lesions. In general the truth panel should not participate in the actual study. It is possible that other regions on an abnormal image that were not biopsy-proven could be malignant, and then the question arises whether to score marks in these regions as LLs or NLs. It is possible that once a woman is referred for a diagnostic workup all lesions are identified in which case this is a non-issue. The truth-panel should be consulted on the likelihood of this happening.The definition of a closeness or proximity criterion should be agreed upon in consultation with the truth panel prior to commencement of the study. The proximity criterion could be an acceptance radius or a percentage overlap of the outlined region and the true lesion. It is good practice to analyze the data with different choices of proximity criteria (say half and double the agreed upon criterion) to determine if the conclusions are sensitive to this choice.
A concern is controlling the level of visibility of the lesions. If the lesions are too easily visible the radiologists may not generate appreciable numbers of NLs and the data set cannot be analyzed. One way to increase the level of difficulty is to use images from a previous screening where the lesion may be retrospectively visible. Also one could include difficult normal images in the case set. In either situation the case set may not be representative of the population one might wish to study, but this may be acceptable in the context of a laboratory study, especially for modality comparison studies where one is interested in differences and not in absolute values.
The radiologist should be told the general characteristics of the lesions, e.g., the range of lesion sizes, contrasts, average number of lesions per image, etc. They should not be forced to make a minimum number of marks. It is quite possible that some images will have no marks. Conversely, the investigator should not limit the number of marks the radiologist can make, but large numbers of marks per image and / or multiple marks in the same vicinity are unusual for mammography and probably indicate misunderstanding of the task. This problem is more likely to occur in CAD where the algorithm may generate multiple marks in the same vicinity. These are usually resolved by the algorithm designer using a clustering step prior to analysis.
Phantom or simulation studies are a convenient way of controlling truth and level of difficulty. The number of lesions per image should not be too large, especially for large lesions. A rule of thumb is one large mass per abnormal image, or 1 to 3 small masses per image. While statistical power does increase (modestly) with more lesions per image, having too many lesions detracts from the realism of the simulation study. Attention should be paid to the quality of the simulation (do the simulated masses resemble real masses?) and the locations where the lesions are superimposed (not all locations are equally likely to harbor lesions). If a normal image with a superimposed lesion is used as a simulated abnormal image, the same image without the superimposed lesion must not be used as a normal image, as this would violate the independence assumption. Likewise, images from other views of an abnormal breast or the contralateral breast must not be used as normal images, even though no lesions are visible on these views to the expert panel. If the backgrounds are also simulated (e.g., power law noise) there is no justification, in my opinion, for using radiologists to interpret such images. There is nothing in their training background that specially qualifies them to read such images and using them in this mode is wasteful of clinical resources. Anybody with good eye-sight and the ability to follow directions can with sufficient training be used to read simulated background simulated lesion images.
In practice the mammographer interprets 4 views of the same patient (two per breast). Such multi-view studies are more difficult to score. If a lesion is marked in a view but not in the other view, does it count as a LL, or if it is marked in both views do they count as two LLs? The former approach is used in case-based scoring and the latter in view-based scoring. The free-response method can be used in either situation. Possible correlations of the multiple marks in the 4-views are properly handled by modern FROC resampling methods to be described later, so long as the unit of resampling is the case, not the individual views.
Localization accuracy
An important aspect of a free-response study is localization accuracy. A certain amount of inaccuracy is unavoidable with human observers but if the spread is too large it may be difficult to score the marks unambiguously. For digital images the location should be indicated directly on the image with a mouse click. Anatomical statements such as "upper outer quadrant", "peri areolar" etc., are too ambiguous. Having the radiologist mark a hard-copy schematic (as they often do in the clinic) and subsequently transferring the location to the digital image can lead to high inaccuracy. The reason why the free-response method gives larger statistical power than ROC is because it gives credit for true detection and penalizes for marks far from lesions: with increasing marking inaccuracy this distinction gets increasingly blurred. With sufficiently large spread the free-response dataset will effectively degenerate into a ROC dataset, where any mark on an abnormal image would count as having "detected" the lesion, and any advantage of conducting a free-response study will have been lost. For films one can use an acrylic overlay properly aligned to the image using anatomic landmarks and the radiologist marks the location(s) and writes the rating(s) next to each mark. For small lesions the radiologist can generally indicate the center fairly accurately, but for larger ones they should be encouraged to outline the suspected lesion. The centroid of the outlined region is regarded as the location of the mark. Radiologists have different perceptions about what constitutes the lesion boundary, and the centroid method should result in higher inter and intra reader agreement than a single mark. A mark is scored as a LL or a NL according to its proximity to the nearest lesion center.For digital images the interface should be kept simple. Unless the purpose of the study dictates otherwise common display functions such as window/level, zoom and pan and other available tools should be enabled. When the radiologist clicks on a region a cross should be overlaid on the image and the rating recorded via a pop-up window (or slider for continuous ratings). The numerical value should be overlaid next to the mark. This is to help the radiologist keep track of the marks and ratings and not inadvertently mark a region multiple times. All overlay information should be capable of being toggled off or on (SHOW MARKS, HIDE MARKS) as otherwise they will interfere with the interpretation. The location information (x and y for projection images and x, y and slice number for tomographic images) should be recorded. When the radiologist has concluded interpreting they should click a DONE button allowing them to review their marks before clicking on NEXT CASE. A TRACE LESION function should be provided.
Ratings scale
For human observers particular attention needs to be paid to the rating scale. The idea is to obtain operating points that adequately sample (or "straddle") the curve as otherwise the method may be unreliable (this limitation applies to all methods of measuring observer performance. This is not a problem with CAD data, where the ratings are closely spaced and there are many more marks than with humans, both of which result in better sampling of the underlying curve. In screening mammography the radiologist assigns a Breast Imaging Reporting and Data System (BIRADS) rating as described below:0: Need Additional Imaging Evaluation and/or Prior Mammograms for Comparison
1: negative
2: Benign Finding(s)
3: Probably Benign Finding – Initial Short-Interval Follow-Up Suggested
4: Suspicious Abnormality – Biopsy Should Be Considered
Optional subdivisions:
4A: Finding needing intervention with a low suspicion for malignancy
4B: Lesions with an intermediate suspicion of malignancy
4C: Findings of moderate concern, but not classic for malignancy
5: Highly Suggestive of Malignancy – Appropriate Action Should Be Taken
The screeing mammograper dictates what the finding(s) are, their size and location, and what other tests need to be done for the recall.
A zero or a rating of 3 or more (infrequently some 2's) results in the patient being recalled, i.e., referred to further workup (often causing mental anguish for the patient). A BIRADS 1 or 2 results in the patient being told to come for her next routine screeing examination (this is what most patients want to hear).
In the context of a laboratory study, where the mammographer knows there will be no clincial consequence to their ratings, it is legitimate to ask them to assign a BIRADS rating (>2) to the locations that they considered suspicious enough to recall the patient, i.e., the patients who would have been assigned the "0" rating in real life.
The mammography BIRADS scale supports the following unidirectional rating scale: 1, 3, 4A, 4B, 4C and 5.
Preliminary FROC curves and feedback
It is important to ascertain that the radiologists are familiar with the tasks of the observer study and the user interface. One or more training sessions should be conducted using data from a separate training set to familiarize the radiologists with the task, and to provide feedback to both the radiologists and the investigators so that proper FROC data can be collected in the actual observer study. Preliminary FROC curves should be constructed from the training set data. See for example Fig. 6 in (Bunch et al. 1978) - e-mail me for a pdf of this file. If all points fall on the y-axis the radiologist needs to be told that their FP rate is zero, which is good, but they are missing a significant fraction of lesions, e.g., 40%, that were visible to the expert panel. This knowledge, and the fact that this is a laboratory study with no clinical consequences, may induce the radiologist to be more aggressive in reporting lesions. If a radiologist operates along the vertical axis and detects all the lesions, by definition the radiologist is perfect. This implies the task is too easy and the case set needs to be modified accordingly (more difficult lesions and more ambiguos normal images). As a rule of thumb, the radiologists should be able to detect between 60% and 80% of the lesions at their most lax criterion. An unduly difficult task may discourage active participation in the study. Using the rating of the highest rated mark on each image as its "ROC" rating, the FPF should be greater than about 40%. Following the table below, about 40% of negative cases should contain a mark with a BI-RADS rating of 3 or above. If FPF is too low, adequate numbers of FPs are not being made, and the figures of merit become unreliable leading to loss of measurement precision.Sample protocol
It is essential to have a written protocol duly acknowledged by all participants prior to commencement of the study. An example follows.Purpose of the study
The purpose of the study is to compare digital mammography with digital tomosynthesis of the breast for detection of malignant masses ranging in diameter from 2 mm to 1 cm.The task
In each modality you will be interpreting 50 normal images, and 50 abnormal images with 1 to 2 malignant masses per image (mean 1.3). Your aim is to attempt to find all the malignant masses. You should not report masses that are definitely benign. Indicate the locations of ("mark") all regions that are suspicious for malignancy. The total number of marks will equal the total number of suspicious regions, which need not equal the maximum number of masses (two). The analysis gives credit for masses that are marked and rated high, and penalizes false positives that are rated high. Your score will only be used in de-identified form and will not be revealed to others. The marks are to be made using the cursor and the rating is to be made on a pop-up menu (or slider) using any one of the scales described in the following table (insert the BIRADS table shown above).The center of a suspicious region needs to be marked with reasonable accuracy as it affects the analysis. If you are uncertain about the center location, perhaps because the lesion is diffuse or irregular in shape, trace the outline of the region using the TRACE function. If the same lesion is visible in both views of the breast, mark both of them but give them the same malignancy rating. For the tomosynthesis images mark the lesion on the slice that best visualizes it. The marks can be toggled on and off to reduce distraction (SHOW MARKS, HIDE MARKS buttons). When you have competed interpretation of the case click the DONE button. A summary of all your data for that image will be shown. At this point you may click either the Modify button, to correct data entry errors or accidental clicks, or click the Next Case button.
We recognize that the data collection is more complex than other observer studies that you may have participated in. Prior to the study we will meet to discuss the study to make sure it is clear to you. A training session with approximately 30 cases will be conducted to clarify the procedures, familiarize you with the user interface and to ascertain that the rating scale is being properly used.
P. C. Bunch, J. F. Hamilton, G. K. Sanderson and A. H. Simmons, "A Free-Response Approach to the Measurement and Characterization of Radiographic-Observer Performance," J of Appl Photogr. Eng. 4 (4), 166-171 (1978).