
For this discussion, a conservative test is one that classifies as positive only when it is very sure (very few false positives). A sloppy test is one that is very quick to classify positive cases (lots of false positives).

Metrics that quantify how positive cases are classified (sum to 1):
 True Positive Rate (Sensitivity/Hit Rate/Recall):
 $\hat{P}/P \simeq P(\mbox{positive classification}\mbox{positive sample})$
 Evaluates how good we are at spotting positive cases.
 A conservative test would have low sensitivity but a very sloppy test that indicates a positive result for all samples would have a sensitivity of 1.
 False Negative Rate (Miss Rate/Type II Error):
 $\bar{P}/P \simeq P(\mbox{negative classification}\mbox{positive sample})$
 A conservative test would have a high false negative rate and a sloppy test would have a low one.
 These two metrics tend to reward the same kind of test (conservative does poorly and sloppy does well). This is likely why most people only talk about sensitivity.

Metrics that quantify how negative cases are classified (sum to 1):
 False Positive Rate (False Alarm Rate/FallOut/Type I Error):
 $\hat{N}/N \simeq P(\mbox{positive classification}\mbox{negative sample})$
 Indicates the proportion of negative samples that are classified as positive.
 Conservative tests do well here because they only classify as positive if they’re very sure. A sloppy test will score poorly here since it lumps in a bunch of negative cases with it’s positive classifications.
 True Negative Rate (Specificity):
 $\bar{N}/N = (N\hat{N})/N \simeq P(\mbox{negative classification}\mbox{negative sample})$
 Conservative tests do well here since they mistake very few negative cases for positive. Sloppy tests do poorly.
 Both of these metrics reward a conservative test.

The classic tradeoff made by ROC curves is true positive rate (sensitivity) versus false positive rate (false alarm rate) to strike the right balance between conservative and sloppy testing.

There are a variety of other metrics that attempt to quantify overall test performance:
 Accuracy:
 $(\hat{P}+\bar{N})/(N+P) \simeq P(\mbox{correct classification})$
 Rewards both sensitivity (large $\hat{P}$) and specificity (small $\hat{N}$ or large $\bar{N}$).
 Usually a poor metric for rare events since a very insensitive test can still achieve a high level of accuracy if $\bar{N}$ is very high.
 Confidence/Precision/Positive Predicted Value:
 $\hat{P}/(\hat{P}+\hat{N}) \simeq P(\mbox{positive sample}\mbox{positive classification})$
 A conservative test with a large number of false negatives could have very high confidence.
 False Discovery Rate:
 $\hat{N}/(\hat{P}+\hat{N}) \simeq P(\mbox{negative sample}\mbox{positive classification})$
 False Omission Rate:
 $\bar{P}/(\bar{P}+\bar{N}) \simeq P(\mbox{positive sample}\mbox{negative classification})$
 False Omission Rate:
 $\bar{N}/(\bar{P}+\bar{N}) \simeq P(\mbox{negative sample}\mbox{negative classification})$

Detecting chronic shelter use:
 Accuracy is a poor metric since only 3% (clustering) or 4.8% (DI definition) of DI clients are chronic.
 Confidence is important. If a test indicates a client is chronic, we want to make sure that they actually are.
 Sensitivity is also important since we want to ensure we’re catching everyone who needs help.
 We can use false discovery rate to estimate how many folks are getting help who may not need it.