DevChakraborty.com

Why FROC? Why not ROC?


Overview

The basic reason is increased measurement precision, allowing more precise measurements of inter-modality differences. This can be seen from the following analogy. (Note that if localization is not a factor in the clinical task, e.g., deciding if an already found lesion is malignant, then ROC methods are quite appropriate.) (pdf version)

Observer performance viewed as grading a "test" or "exam"

A useful analogy can be made between observer performance and a student taking a test.  The point we wish to make with this analogy is that the more accurately one scores the test the better one can estimate the student's ability and distinguish between superior and poor students. Consider a hypothetical multiple-choice test where each question consists of multiple statements each of which could be true or false and there are many statements per question.  The student's task is to "check-off" (mark) true statements and assign a number (rating) to each marked statement.  The number reflects the confidence that the statement is true.  For example, 4 = highly confident, 3 = quite confident, 2 = somewhat confident, 1 = low (but non-zero) confidence. Statements that are obviously untrue are ignored by the student -putting a zero next to them is redundant.  Each question consists of zero or more true statements and their number is unknown to the student.  If a particular question has no true statements the "ideal" student would ignore all of the statements and not even assign 1's to any of them.  If a question has two true statements the ideal student would assign 4's to each of them and ignore the rest.

We have just described a free- response task.  In interpreting an image for possible breast cancer there could be zero, or more malignant lesions.  The mammographer does not know a-priori how many lesions are present and where they could be located and therefore each image is searched for regions that appear suspicious for cancer.  If the level of suspicion of a particular region exceeds the threshold for clinical reporting the mammographer reports it.  For example, a report could be dictated as "this patient has a very suspicious lesion, the lesion is here and the patient should be seen immediately for a diagnostic workup"; or "this patient has two suspicious regions, one here and one here and both are minimally suspicious and the patient needs short term follow-up and needs to be seen in 6 months"; or "this patient appears completely normal and needs to be seen next year as part of her regular screening regimen".  In a free-response study the level of suspicion is recorded as a rating: e.g., 4 = highly suspicious for cancer, etc, similar to the BIRADS reporting scale and the unit of data is a mark-rating pair, where the mark refers to the physical location of a suspicious region and the rating is the degree of suspicion. By adopting a suitable "nearness" criterion the investigator classifies each mark as lesion localization (LL) or non-lesion localization (NL).  A LL occurs when the mark is sufficiently close to a true lesion and otherwise it is a NL.  In the test analogy LL corresponds to marking a true statement and NL to marking a false statement.  The questions correspond to cases, the student corresponds to a radiologist and the analog of imaging modalities could be different teaching methods, and modality comparison could be determining the optimal teaching method.  [We make a distinction between suspicious regions and lesions; radiologists do not report lesions, rather they report suspicious regions that appear to be lesions.  We reserve the term "lesions" for true lesions, established independently.  Also we avoid using the terms true positive, false positive or detection in the FROC context, as they are widely used in studies where no location information is collected.] 

Returning to the test example how does one score the test?  The test data consists of a variable number (0, 1, 2, ...) of marked statements with ratings 1 through 4 attached to them.  One way of analyzing the data, admittedly crude, would be to separate the questions into two groups, those with at least one correct statement (abnormal), and those with no correct statements (normal).  This is what is done in ROC scoring.  Consider a question from the abnormal group in which a statement, which could be an incorrect statement, is rated 3 and all other statements are rated ≤ 3 or unmarked.  ROC analysis scores this as a true positive at level 3.  Likewise, if on a question from the normal set a statement is rated 2 and all other statements are rated ≤ 2 or unmarked, it is scored as a false positive at level 2.  Questions with no marks are assigned a default "0" rating.  ROC operating points are calculated in the usual manner by cumulating counts in each rating bin. This leads to 4 operating points (cumulating no questions yields the origin and including the "0" rated questions yields the upper right hand corner of the ROC plot).

The ROC curve is the plot of true positive fraction (TPF) vs. the false positive fraction (FPF).  ROC analysis measures the ability of the mammographer to separate normal from abnormal images.  The probability of correct classification is the area under the ROC curve (AUC).  If one repeats the study with the same mammographer and the same set of patients, but this time the images come from a different modality, say breast CT, then if AUC increases one can conclude that breast CT is superior to conventional screen-film mammography.  Sampling considerations determine the number of readers and cases that are needed to detect an improvement of a specified amount (the effect size), i.e., the statistical power, a very important quantity as it determines the cost of conducting the study.

No statistician would score the hypothetical test in the manner just described: there is simply too much loss of information.  The fact that any marked statement on the abnormal set is counted as a true positive means that a student who marks a false statement at level 3 and another who marks a true statement would be indistinguishable.  Likewise a student who marked two true statements at level 3 would be indistinguishable from one who marked only one.  In the imaging context a radiologist who missed a lesion and marked a normal region is not equivalent to one who marked the lesion and did not mark a normal region.  The clinical consequence of the two canceling mistakes (a false negative and a false positive) made by the first radiologist is serious.  The fact that they are scored the same implies suboptimal ability to detect differences between readers and / or modalities, i.e., low statistical power.

The localization ROC or LROC paradigm was born out of the need to do "something" about the location information.  In LROC the observer marks and rates the single most suspicious region in the image.  Each image generates one - and exactly one - mark-rating pair (this is why LROC data, like ROC, is well-structured).  On normal images the rating is used to calculate FPF.  On abnormal images if the mark is close to the lesion then the image is scored as a correct localization, and otherwise it is scored as an incorrect localization.  The LROC curve is the plot probability of correct localization (PCL) vs. FPF.

Coming back to the test example, the data as collected cannot be analyzed by the LROC method.  This is because there could be questions where the student does not assign any rating, i.e., ignores all the statements, because they are all patently false.  In the LROC method the student must indicate the truest of the patently false statements and rate it and only then can the data be analyzed by the method.  This is an unnatural task that has created problems in clinical LROC data collection.  Nevertheless, LROC is widely used in nuclear medicine applications.

The statistician analyzing the hypothetical test would probably not use the LROC method either.  While it is including more information than ROC it is still ignoring a lot of information and in some cases it is forcing the student into an unnatural task.  One can make up examples demonstrating scoring inconsistencies when using the LROC method but we will not belabor the point.  In FROC analysis one analyzes all the mark-rating data.  The student who assigns a 4 rating to a true statement (LL) is scored differently from one who assigns a 4 rating to a false statement (NL).  The scoring is at the level of the statements not the questions.  If the mammographer misses a lesion and marks a non-lesion, the mark is scored as NL and if the lesion is marked and the non-lesion is ignored, the mark is scored as a LL – clearly distinguishing between the two cases.  Images with no marks also provide valuable information: we will show later that they influence one of the search model parameters.  As one may suspect analysis of this type of data is challenging because the data is not well-structured as the number of data-units on any question is a-priori unknown.  In 1961 Egan et al. stated "the method of free response is particularly difficult to analyze simply because a trial is not defined…".  It is not a stretch to say that solving the FROC problem has been one of the major challenges in imaging science.


 



Dev P. Chakraborty, Ph. D. | 2103 Noble Ct, Murrysville PA 15668 | ©2005 DevChakraborty.com