IEEE VAST Challenge data sets


The six scenarios and datasets referenced below each provide a significant dataset that represent a level 2 fusion problem.   The simplest scenarios are from 2006 and 2007.  Beginning in 2008, the range and complexity of the scenarios increased, and were divided into mini-challenges and grand challenges.  Data sets included unstructured text, structured data, images, and video data.

A quick link to each is here: VAST 2006    VAST 2007    VAST 2008    VAST 2009    VAST 2010    VAST 2011
The 2012  basic challenge is at this link:

More detail on each are provided below


The scenarios and datasets below are from the Visual Analytics Science and Technology (VAST) challenge, which is a participation category of the IEEE VAST Symposia.  They have the purpose of pushing the forefront of visual analytics tools using benchmark data sets and establishing a forum to advance visual analytics evaluation methods. The objective of the forum is to speed the transfer of visual analytics technologies from research labs to commercial products, and to increase the availability of evaluation techniques.

Scenario and Dataset Descriptions (by year)

1. VAST 2006  (Grinstein, G., O’Connell, T., Plaisant, C., Scholtz, J., Whiting, M.,  IEEE VAST 2006 Contest, The tale of Alderwood, (2006))

SCENARIO:   (Go Here for Scenario Details)  In January 2003, the FBI is investigating possible political misbehavior in the fictitious mid-sized vacation town of Alderwood, located on the banks of the Alderwood River in south-central Washington State.  Alderwood is suffering from a loss of tourism due to the early 2000s economic crash.  In addition, agriculture is adversely affected by the discovery of bovine spongiform encephalitis (BSE, also known as ‘mad cow disease”), resulting in a beef export embargo.  Yet, there is a sudden influx of young talented men and women relocating to Alderwood, with claims that there are some connections to the local government.  The sources and reasons are not immediately obvious.

OBJECTIVE: Identify and describe what is happening in Alderwood.

DATASET OVERVIEW (All datasets are synthetic) – Go Here For VAST 2006 Dataset  The dataset has the following types of data.  File type shown in parenthesis

  • Unstructured Text (TXT):
  • 1182 news stories from the Alderwood Daily News (TXT)
  •  four other newspaper or webpage items (DOC)
  • Photographs:  Six photos (JPEG) with a separate file with annotations (TXT)
  • Maps: Two maps of Alderwood and vicinity  (JPEG))
  • Structured Files:
  • Alderwood voter registration datafile - 41,723 entries (XLS)
  • Conference room phone log - 100 entries.  (XLS)
  • Two files with investigative background information (TXT)

In addition, a second set of files are available in which the 1182 news story files have been preprocessed through an entity extraction routine (using the MITRE ALEMBIC tool.  Unfortunately, the link to describe ALEMBIC is broken).

2. VAST 2007

SCENARIO (Go Here for Scenario Details) : In the Fall of 2004, you are the analyst for an unnamed agency investigating some unexpected activities concerning wildlife law enforcement, endangered species issues, and ecoterrorism. 

OBJECTIVE: Determine what is occurring, based on the data.  

DATASET OVERVIEW (All datasets are synthetic) – Go Here For VAST 2007 Dataset  The dataset has the following types of data.  File type shown in parenthesis.

  • Unstructured Text (TXT):
  • 1455 news stories (TXT)
  •  Background information documents (7 PDF, 4 DOC)
  • Photographs:  Ten photos (JPEG), no annotations
  • Blog File with 12 embedded images
  • Structured Files:
  • Wildlife Import Permits  - 2,036 entries (XLS)
  • Tropical Fish Importers information - 179 entries.  (XLS)


3. VAST 2008  Beginning in 2008, the VAST challenges changed from a single challenge to a set of mini-challenges, which were then combined into a grand challenge

SCENARIO  AND OBJECTIVES (Go Here for Scenario Details):  The fictional Caribbean island nation of Isla Del Sueño is experiencing a new religious movement, the Paraiso movement, which is causing controversy and political unrest.  You’ve been asked to investigate certain aspects about this movement. 

  • The first is to examine the edit page for the movement’s Wikipedia page.  From the edit record, determine both the social relationships of the editors, and whether the movement is violent or not.
  • The second is that the government of  Isla Del Sueño begins to crack down on the movement, resulting in a 3 year mass migration by boat from Isla Del Sueño to the United States.  Examine Coast Guard intercept records characterize the choice of landing sites and their evolution over the three years, the geographical patterns of Coast Guard interdiction over the three years, and the successful landing rate over the time period.
  • The third is to determine the social network structure of the movement’s leader, based on cell phone records.
  • The fourth involves a bombing at a Florida Department of Health building, which is blamed on members of the group who migrated to Florida.  Using RFID tag data from employee and visitor badges worn in the building, determine the sequence of events and persons possible involved in the bombing.

Overall – integrate all data to determine the social network of the Paraiso movement at the end of the time period, names can be associated with individual activities, the geographical range of the Paraiso Movement and how it changes over time, and how the major beliefs of the Paraiso movement affect their activities

DATASET OVERVIEW (All datasets are synthetic) – Go Here For VAST 2008 Dataset  The dataset has the following types of data.  File type shown in parenthesis

  • A fictitious Wikipedia page and its associated edits page (DOC)
  • A set of structured data (XML) that describes Coast Guard intercepts of illegal immigration at sea.
  • A set of structured daya (CSV) consisting of cell phone calling records
  • A set of structured data (TXT) that identify what people had what RFID badges, and the movement of those badges throughout  a building.
    • Cell phone record for 400 cell phone numbers (9,834 calls) over a 10 day period (CSV)
  • Map Data
    • Cell tower map (PPT)

4.VAST 2009

SCENARIO:   (Go Here for Scenario Details)  and Objectives:  An embassy employee for the embassy in the county of Flovania is suspected of sending data to an outside criminal organization.  Determine who the employee is, using movement data gathered from badge tracking, system network logs, social networking site data, and video surveillance camera data..

OBJECTIVE: Determine the scenario.  Who are the major players in the scenario and what are their relationships? 

DATASET OVERVIEW (All datasets are synthetic)  Go Here For VAST 2009 Dataset:

  • Structured files
    • One month’s building badge-based movement records
    • One month’s computer network traffic for each employee’s work computer
    • Social network data for a social network /micro-blogging tool (Twitter-like).  Two files,  one describing entities (i.e., either a user-name, a Flitter nickname and not the person’s real name, or a city or a country) and one containing links. 
  • Map of Flovania, its major cities, and information about neighboring countries and their major cities
  • 11 hours of video surveillance results

5. VAST 2010

SCENARIO:  The scenario is divided into three mini-challenges, with a “grand challenge” to combine the results of the mini-challenges. Mini-Challenge 1 is about an illegal arms deals that involves several countries.  Mini-Challenge 2 is about a pandemic outbreak of a virus across several cities in the world.  Mini-Challenge 3 is to investigate the source of a virus strain taken from a victim of the pandemic.  For the Grand Challenge, determine any possible linkage between the illegal arms dealing and the pandemic outbreak

OBJECTIVE: 1. Briefly describe your hypothesized linkage between the arms dealing activity and the pandemic outbreak. 
2. Given the hospital and death records, characterize the e the spread of the disease, and determine any anomalies across countries.
3.  Given some genetic data on a strain of the virus, determine the country of origin of the virus, and its mutations and resistances to treatment 
4. We had countries with arms dealers identified in MC 1 that did not suffer pandemic outbreaks in MC 2.  Provide a hypothesis as to why some countries that may have been involved with arms dealers did not suffer an outbreak? 

DATASET OVERVIEW (All datasets are synthetic)  Go Here For VAST 2010 Dataset:

  • Mini-challenge 1 dataset:  Five unstructured text files (DOC)
    • 27 government intelligence reports
    • 17 email intercept reports
    • 31 phone intercept reports
    • 21 newspaper articles
    • 7 blog articles
  • Mini-challenge 2 dataset: 20 structured data files (CSV), containing summary medical and death reports from 10 countries.
  • Mini-challenge 3 dataset:  3 structured data files (TXT), giving genetic sequence data on several strains of a virsu and a report file with virulence and drug resistance characteristics of certain virus strains

6. VAST 2011

SCENARIO:  This scenario takes place in Vastopolis is a major metropolitan area with a population of approximately two million residents.   The scenario is divided into three mini-challenges, with a grand challenge to integrate the results .  Mini-Challenge 1 is to characterize an Epidemic Spread in Vastopolis.  Mini-Challenge 2  looks at security issues in the computer networking operations at a freight company operating in Vastopolis (the All Freight Corporation).  Mini-Challenge 3 is an investigation into terrorist activity in the Vastopolis metropolitan area.  In the Grand Challenge, you are charged with investigating the cause of the epidemic, and determining any link to possible terrorist activity


  1. Are any terrorist activities related to the current epidemic?
  2. Describe the series of events, planned or otherwise, that led to the current epidemic.

DATASET OVERVIEW (All datasets are synthetic) Go Here For VAST 2011 Dataset

For Mini-Challenge 1, three structured databases (XLS) and a map (JPEG)

For Mini-challenge 2, five sets of computer system logs (structured text) across three days

  1. Firewall log
  2. Intrusion Detection System (IDS) logs
  3. Nessus Network Vulnerability Scan Report
  4. Operating System Security Event Log
  5. Packet Capture (PCAP) Log

 For Mini-challenge 3, 4474 unstructured text files.



CG&A Journal paper about the VAST 2007 contest
CG&A Special Issue on Visual Analytics Evaluation. (published in April 2009)