Evaluating Causal Discovery Algorithms with Software Systems

To obtain realistic data for evaluating causal discovery algorithms, we designed and executed factorial experiments on three large-scale software systems: PostgreSQL, the Java Development Kit, and web server infrastructure. Each domain is characterized by three classes of variables: covariates, treatments, and outcomes. Under the factorial experiment design, outcomes were measured for every combination of subjects and treatments. This yields a dataset with many records for the same subject. To permit a variety of data transformation opportunities, we performed multiple trials of each factorial experiment.

PostgreSQL

We collected a sample of StackOverflow‘s database, and gathered a sample of user-authored queries from their query explorer. We executed each query against our sample of the database, varying database configuration and monitoring query execution.

11,252 subjects
3 treatments
7 outcomes
10 covariates

JDK

We downloaded a sample of Maven-enabled Java projects from GitHub. We compiled and ran the unit tests of each project, varying JDK options and monitoring runtime behavior.

473 subjects
3 treatments
5 outcomes
5 covariates

Networking

We identified a pool of websites using a small web crawl. We then executed several web requests against each site, varying request parameters and monitoring output.

2,599 subjects
3 treatments
5 outcomes
1 covariate

Download (212M)
Details

Download (3.2M)
Details

Download (78M)
Details

PostgreSQL Details

Field Name	Category	Description
url	subject identifier	Identifies the source of the query
trial	trial identifier
index_level	treatment	Indicates the level of indexing employed. 0: No indexing 1: Indexing on primary keys and foreign keys only 2: Indexing on primary keys, foreign keys, and other commonly-referenced fields
page_cost	treatment	Indicates the estimated disk access cost provided to the Postgres query planner. 0 corresponds the smallest disk access cost, 3 corresponds the largest disk access cost.
memory_level	treatment	The amount of working memory provided to Postgres, in increasing increments from 0 to 2
local_written_blocks	outcome	The number of blocks written to temporary tables andindices
temp_written_blocks	outcome	The number of blocks written to short-term working memory
shared_hit_blocks	outcome	Number of regular table blocks hit in the cache
temp_read_blocks	outcome	Number of blocks read from temporary tables and indices
local_read_blocks	outcome	Number of blocks read from temporary tables and indices
runtime	outcome	Runtime of the query in milliseconds
shared_read_blocks	outcome	Number of blocks read from regular tables and indices
rows	covariate	The number of rows produced by the query
creation_year	covariate	Year the query was created on the StackOverflow site.
num_ref_tables	covariate	Number of tables referenced by the query
num_joins	covariate	Number of joins employed by the query
num_group_by	covariate	Number of “group by” clauses employed by the query
queries_by_user	covariate	The number of other queries written by the author of this query
length_chars	covariate	Length of the query in characters
total_ref_rows	covariate	Total number of rows of all tables referenced in the query
local_hit_blocks	covariate	Number of temporary table and temporary index blocks hit in the cache
favorite_count	covariate	Number of times another StackOverflow user has marked this query as a favorite

JDK Details

Field Name	Category	Description
repo_name	subject identifier	Name of the GitHub repository containing the experimentation code
trial	trial identifier
debug	treatment	Indicates whether debug symbols were requested during compilation
obfuscate	treatment	Indicates whether a code obfuscator was run on the final JAR file
parallelgc	treatment	Indicates whether a parallel garbage collection was employed during execution (instead of serial garbage collection)
num_bytecode_ops	outcome	Number of bytecode instructions in the compiled code
total_unit_test_time	outcome	Number of seconds required to execute unit tests
allocated_bytes	outcome	Number of bytes allocated during execution of unit tests
jar_file_size_bytes	outcome	Size of JAR file after compilation (and possibly obfuscation)
compile_time_ms	outcome	Number of milliseconds to compile the source
source_ncss	covariate	Number of non-comment source statements in the source code
test_classes	covariate	Number of Java classes in the unit test source
test_functions	covariate	Number of functions in the unit test source
test_ncss	covariate	Number of non-comment source statements in the test source
test_javadocs	covariate	Number of JavaDoc comments in the test source

Networking Details

Field Name	Category	Description
url	subject identifier	Web address requested
trial	trial identifier
mobile_user_agent	treatment	Indicates if the site was requested with a mobile user agent
proxy	treatment	Indicates if the site was requested through a proxy server
compression	treatment	Indicates if the request indicates that compression is supported
html_attrs	outcome	Number of HTML attributes in the response
html_tags	outcome	Number of HTML tags in the response
elapsed	outcome	Elapsed time between request and response in seconds
decompressed_content_length	outcome	Length of the response content, in bytes, after decompression
raw_content_length	outcome	Length of the response content, in bytes, before decompression
server.class	covariate	Category of web server issuing the response

Search Google Appliance

PostgreSQL

JDK

Networking

PostgreSQL Details

JDK Details

Networking Details