Evaluation without Ground Truth

Kathryn Laskey's picture


A perennial problem for evaluation of fusion methods is identifying ground-truth data for evaluating accuracy. This problem is particularly acute for high-level fusion methods, where for some relations of interest, there may not actually be a definitive ground truth.  For example, we may be interested in whether an individual "is affiliated with," or "sympathizes with" a given organization. These are fuzzy propositions that may not have a definite truth-value.  Even when definite truth- exists, it may be unobservable, or observation may be prohibitively expensive or time-consuming.
A surrogate for ground-truth data is expert-generated judgmental labeling of data.  Typically, this is also extremely expensive to generate, and can also be noisy – reasonable experts often disagree on the correct label to be attached to given data. Data sets generated in this way are often too small for statistically valid results.
Another approach to obtaining a surrogate for ground truth is simulation.  For low-level fusion, simulation is fairly straightforward. A key issue for high-level fusion is simulating rich semantics, including various types of interacting entities, and noise in observations.  Ideal capabilities of a M&S environment to support high-level fusion are discussed in [1].  A challenge with simulation is to ensure realism: if data for evaluating fusion algorithms are generated by a simulation, then the algorithms will be graded by how well they reproduce the simulated scenarios from the simulated data. This translates to effective evaluation for the real world only to the extent that the simulation is faithful to the world in relevant ways.
An interesting approach to generating evaluation data is to use gaming technologies to help improve the realism of simulations of high-level agent behaviors by providing visualizations and gaming engine behavior generators.  For example, [2] describes a proof-of-principle demonstration of the use of gaming technologies and platforms for designing and evaluating fusion algorithms. The approach of [3] is to define a standardized API to exchange information between M&S applications and gaming platforms.
Borrowing ideas from semi-supervised learning [4, 5] as practiced by the machine learning community may be helpful. Semi-supervised learning leverages a small number of labeled examples by combining them with a larger corpus of unlabeled examples.  Experience has shown that proper use of unlabeled examples can improve performance of a learner.
The working group is going to need to tackle problems in which “truthed” data are rare or non-existent.  We need evaluation techniques that can make use of whatever form of “truthing” is available – definitive measurement, expert judgment, simulation – and leverage that to best advantage. When resources for “truthing” are scarce, we should also devise principled means of prioritizing the labeling process (e.g., use of value-of-information concepts to determine which cases are most important to obtain ground truth).
1. C. Pizzo, G. Powell, C. Brown III, and J. May, Modeling and Simulation Support for Answering Commanders' Priority Intelligence Requirements, 10th International Command and Control Research and Technology Symposium, June 2005. 
2. L. Lewis, T. Adamson, M. Studley, M. Faucon, Yanbin Guo, and C. Melanson, Using gaming technologies and platforms to experiment with fusion algorithms for situation management, in IEEE Military Communications Conference, 2009. MILCOM 2009, 2009, pp. 1-6.
3. K. Doris, M. Larkin, M. Zieniesicz, and R. Szymanski. Applying Gaming Technology to Military Visualization - Games Where You Only Live Once!, Simulation Technology Magazine, Vol. 8, Issue 1, April 28 2005.
4. Zhu, X. Semi-supervised learning literature survey,TR
5. Zhu, X., Goldberg, A. (2009) Introduction to Semi-Supervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3, 1-130. Morgan & Claypool Publishers, 2009.