DevChakraborty.com

How to conduct a free-response study


Since the free-response method is less familiar it is appropriate to describe the practical aspects of conducting such studies.  The specific example chosen is screening mammography with 4-views per case (patient) and the task is to detect malignant lesions (pdf version).

General

It is important to realize that while the free-response method is closer to the clinical paradigm, it is not the "real thing".  Like ROC, LROC and the ROI methods it is a laboratory study. There is an assumption that laboratory studies correlate with performance in the clinic.  Having committed to a laboratory study it is important to conduct the study in as optimal a manner as is possible.  This is not to suggest that free-response methods are not applicable to clinical studies, rather optimal power may not be possible under actual clinical reading conditions.  This means greater resources, i.e., more readers and cases, are generally needed in a clinical study than is possible under laboratory conditions.

 

Truth (gold standard)

For normal images radiological truth can be retrospectively established by follow-up mammograms interpreted as normal.  Malignant lesions are proven by biopsy. The truth-panel consisting of expert mammographers should locate the known malignancies on the mammograms using all available clinical information. They should indicate the lesion boundary by outlining it. This is preferable to simply indicating the center, as the latter may not be well defined, especially for larger lesions.  In general the truth panel should not participate in the actual study.  It is possible that other regions on an abnormal image that were not biopsy-proven could be malignant, and then the question arises whether to score marks in these regions as LLs or NLs.  It is possible that once a woman is referred for a diagnostic workup all lesions are identified in which case this is a non-issue. The truth-panel should be consulted on the likelihood of this happening.

The definition of a closeness or proximity criterion should be agreed upon in consultation with the truth panel prior to commencement of the study.  The proximity criterion could be an acceptance radius or a percentage overlap of the outlined region and the true lesion.  It is good practice to analyze the data with different choices of proximity criteria (say half and double the agreed upon criterion) to determine if the conclusions are sensitive to this choice.

A concern is controlling the level of visibility of the lesions.  If the lesions are too easily visible the radiologists may not generate appreciable numbers of NLs and the data set cannot be analyzed.  One way to increase the level of difficulty is to use images from a previous screening where the lesion may be retrospectively visible.  Also one could include difficult normal images in the case set.  In either situation the case set may not be representative of the population one might wish to study, but this may be acceptable in the context of a laboratory study, especially for modality comparison studies where one is interested in differences and not in absolute values.

The radiologist should be told the general characteristics of the lesions, e.g., the range of lesion sizes, contrasts, average number of lesions per image, etc.  They should not be forced to make a minimum number of marks. It is quite possible that some images will have no marks.  Conversely, the investigator should not limit the number of marks the radiologist can make, but large numbers of marks per image and / or multiple marks in the same vicinity are unusual for mammography and probably indicate misunderstanding of the task.  This problem is more likely to occur in CAD where the algorithm may generate multiple marks in the same vicinity.  These are usually resolved by the algorithm designer using a clustering step prior to analysis.

Phantom or simulation studies are a convenient way of controlling truth and level of difficulty. The number of lesions per image should not be too large, especially for large lesions.  A rule of thumb is one large mass per abnormal image, or 1 to 3 small masses per image.  While statistical power does increase (modestly) with more lesions per image, having too many lesions detracts from the realism of the simulation study. Attention should be paid to the quality of the simulation (do the simulated masses resemble real masses?) and the locations where the lesions are superimposed (not all locations are equally likely to harbor lesions).  If a normal image with a superimposed lesion is used as a simulated abnormal image, the same image without the superimposed lesion must not be used as a normal image, as this would violate the independence assumption. Likewise, images from other views of an abnormal breast or the contralateral breast must not be used as normal images, even though no lesions are visible on these views to the expert panel.  If the backgrounds are also simulated (e.g., power law noise) there is no justification, in my opinion, for using radiologists to interpret such images.  There is nothing in their training background that specially qualifies them to read such images and using them in this mode is wasteful of clinical resources.  Anybody with good eye-sight and the ability to follow directions can with sufficient training be used to read simulated background simulated lesion images.

In practice the mammographer interprets 4 views of the same patient (two per breast). Such multi-view studies are more difficult to score. If a lesion is marked in a view but not in the other view, does it count as a LL, or if it is marked in both views do they count as two LLs?  The former approach is used in case-based scoring and the latter in view-based scoring.  The free-response method can be used in either situation.  Possible correlations of the multiple marks in the 4-views are properly handled by modern FROC resampling methods to be described later, so long as the unit of resampling is the case, not the individual views.

Localization accuracy

An important aspect of a free-response study is localization accuracy. A certain amount of inaccuracy is unavoidable with human observers but if the spread is too large it may be difficult to score the marks unambiguously. For digital images the location should be indicated directly on the image with a mouse click.  Anatomical statements such as "upper outer quadrant", "peri areolar" etc., are too ambiguous.  Having the radiologist mark a hard-copy schematic (as they often do in the clinic) and subsequently transferring the location to the digital image can lead to high inaccuracy.  The reason why the free-response method gives larger statistical power than ROC is because it gives credit for true detection and penalizes for marks far from lesions: with increasing marking inaccuracy this distinction gets increasingly blurred.  With sufficiently large spread the free-response dataset will effectively degenerate into a ROC dataset, where any mark on an abnormal image would count as having "detected" the lesion, and any advantage of conducting a free-response study will have been lost.  For films one can use an acrylic overlay properly aligned to the image using anatomic landmarks and the radiologist marks the location(s) and writes the rating(s) next to each mark.  For small lesions the radiologist can generally indicate the center fairly accurately, but for larger ones they should be encouraged to outline the suspected lesion.  The centroid of the outlined region is regarded as the location of the mark. Radiologists have different perceptions about what constitutes the lesion boundary, and the centroid method should result in higher inter and intra reader agreement than a single mark.  A mark is scored as a LL or a NL according to its proximity to the nearest lesion center. 

For digital images the interface should be kept simple.  Unless the purpose of the study dictates otherwise common display functions such as window/level, zoom and pan and other available tools should be enabled.  When the radiologist clicks on a region a cross should be overlaid on the image and the rating recorded via a pop-up window (or slider for continuous ratings).  The numerical value should be overlaid next to the mark.  This is to help the radiologist keep track of the marks and ratings and not inadvertently mark a region multiple times. All overlay information should be capable of being toggled off or on (SHOW MARKS, HIDE MARKS) as otherwise they will interfere with the interpretation.  The location information (x and y for projection images and x, y and slice number for tomographic images) should be recorded.  When the radiologist has concluded interpreting they should click a DONE button allowing them to review their marks before clicking on NEXT CASE.  A TRACE LESION function should be provided.

Ratings scale

For human observers particular attention needs to be paid to the rating scale.  The idea is to obtain operating points that adequately sample (or "straddle") the curve as otherwise the methods described in this chapter may be unreliable.  This is not a problem with CAD data, where the ratings are closely spaced and there are many more marks than with humans, both of which result in better sampling of the underlying curve.  In screening mammography the radiologist assigns a Breast Imaging Reporting and Data System (BIRADS) rating 0 through 6 with the following meanings {Eberl, 2006 #1826}:

0: incomplete: additional imaging evaluation needed
1: negative
2: definitely benign finding
3: probably benign finding
4: suspicious abnormality
5: highly suggestive for malignancy
6: known biopsy proven malignancy, treatment pending

For ratings 2 – 6 the location of the finding is indicated. Assuming the task is finding malignant lesions, the BIRADS-2 rating (benign finding) is irrelevant.  The BIRADS-6 is also not relevant to a laboratory study. If no marks are made on the image, it is assumed that there is "nothing to report" and the BI-RADS rating is 1. In a laboratory study the BIRADS-0 rating is irrelevant. Therefore the available ratings are 3, 4 and 5 which would give three operating points.  It is important to get intermediate ratings. This can be done by using ratings, such as 3.5 and 4.5 as described below. The intermediate ratings are intended to collect confidence level information on a finer scale, and the radiologist may be willing to provide this, keeping in mind that this is a laboratory study with no clinical consequences.  The radiologists do not have to use the same rating scale. Some may be quite willing to provide quasi-continuous ratings (e.g., 1 through 100). The rating scale can be tailored to the radiologist.

Preliminary FROC curves and feedback

It is important to ascertain that the radiologists are familiar with the tasks of the observer study and the user interface. One or more training sessions should be conducted using data from a separate training set to familiarize the radiologists with the task, and to provide feedback to both the radiologists and the investigators so that proper FROC data can be collected in the actual observer study.  Preliminary FROC curves should be constructed from the training set data.  See for example Fig. 6 in (Bunch et al. 1978).  If all points fall on the y-axis the radiologist needs to be told that their NLF rate is zero, which is good, but they are missing a significant fraction of lesions, e.g., 40%, that were visible to the expert panel.  This knowledge, and the fact that this is a laboratory study with no clinical consequences, may induce the radiologist to be more aggressive in reporting lesions.  If a radiologist operates along the vertical axis and detects all the lesions, by definition the radiologist is perfect.  This implies the task is too easy and the case set needs to be modified accordingly. As a rule of thumb, the radiologists should be able to detect between 60% and 80% of the lesions at their most lax criterion.  An unduly difficult task may discourage active participation in the study.  Using the ROC-equivalent rating, the FPF should be greater than about 40%. Following the table below, about 40% of negative cases should contain a mark with a BI-RADS rating of 3 or above. If FPF is too low, adequate numbers of NLs are not being made, and the figures of merit become unreliable leading to loss of measurement precision.

 

Sample protocol

It is essential to have a written protocol duly acknowledged by all participants prior to commencement of the study. An example follows.

Purpose of the study

The purpose of the study is to compare digital mammography with digital tomosynthesis of the breast for detection of malignant masses ranging in diameter from 2 mm to 1 cm.

The task

In each modality you will be interpreting 50 normal images, and 50 abnormal images with 1 to 2 malignant masses per image (mean 1.3).  Your aim is to attempt to find all the malignant masses. You should not report masses that are definitely benign.  Indicate the locations of ("mark") all regions that are suspicious for malignancy. The total number of marks will equal the total number of suspicious regions, which need not equal the maximum number of masses (two).  The analysis gives credit for masses that are marked and rated high, and penalizes false positives that are rated high. Your score will only be used in de-identified form and will not be revealed to others. The marks are to be made using the cursor and the rating is to be made on a pop-up menu (or slider) using any one of the scales described in the following table (which scale you will use will be mutually agreed upon during the training sessions).  Images for which you do not mark any suspicious lesions will correspond to either BIRADS-1 or BIRADS-2.

Table 1: Suggested ratings scales for a screening mammography free-response study.


Scale

Meaning

Approximate BIRADS rating

Slider scale (% probability of malignancy)

Integer scale

fractional scale

< 2

1

3.0

possibly suspicious for malignancy

BIRADS 3

3-25

2

3.5

probably suspicious for malignancy

 

26-50

3

4.0

suspicious for malignancy

BIRADS 4

51-95

4

4.5

quite suspicious for malignancy

 

95-100

5

5.0

highly suspicious for malignancy suspicious

BIRADS 5

he center of a suspicious region needs to be marked with reasonable accuracy as it affects the analysis.  If you are uncertain about the center location, perhaps because the lesion is diffuse or irregular in shape, trace the outline of the region using the TRACE function.  If the same lesion is visible in both views of the breast, mark both of them but give them the same malignancy rating.  For the tomosynthesis images mark the lesion on the slice that best visualizes it. The marks can be toggled on and off to reduce distraction (SHOW MARKS, HIDE MARKS buttons).  When you have competed interpretation of the case click the DONE button.  A summary of all your data for that image will be shown.  At this point you may click either the Modify button, to correct data entry errors or accidental clicks, or click the Next Case button.

We recognize that the data collection is more complex than other observer studies that you may have participated in.  Prior to the study we will meet to discuss the study to make sure it is clear to you.  A training session with approximately 30 cases will be conducted to clarify the procedures, familiarize you with the user interface and to ascertain that the rating scale is being properly used.

P. C. Bunch, J. F. Hamilton, G. K. Sanderson and A. H. Simmons, "A Free-Response Approach to the Measurement and Characterization of Radiographic-Observer Performance," J of Appl Photogr. Eng. 4 (4), 166-171 (1978).
Eberl MM, Fox CH, Edge SB, Carter CA, Mahoney MC. BI-RADS Classification for Management of Abnormal Mammograms. Journal of the American Board of Family Medicine 2006; 19:161-164.



Dev P. Chakraborty, Ph. D. | 2103 Noble Ct, Murrysville PA 15668 | ©2005 DevChakraborty.com