{"id":67,"date":"2020-01-27T20:58:09","date_gmt":"2020-01-27T20:58:09","guid":{"rendered":"http:\/\/groups.cs.umass.edu\/kdl\/?page_id=67"},"modified":"2020-01-27T21:03:49","modified_gmt":"2020-01-27T21:03:49","slug":"causal-eval-data","status":"publish","type":"page","link":"https:\/\/groups.cs.umass.edu\/kdl\/causal-eval-data\/","title":{"rendered":"Evaluating Causal Discovery Algorithms with Software Systems"},"content":{"rendered":"<div class=\"container\">\n<hr>\n<p>To obtain realistic data for evaluating causal discovery algorithms, we designed and executed factorial experiments on three large-scale software systems: PostgreSQL, the Java Development Kit, and web server infrastructure. Each domain is characterized by three classes of variables: covariates, treatments, and  outcomes.  Under  the factorial experiment design, outcomes were measured for every combination of subjects and treatments.  This yields a dataset with many records for the same subject. To permit a variety of data transformation opportunities, we performed multiple trials of each factorial experiment.<\/p>\n<div class=\"row\">\n<div class=\"col-md-4\">\n<h3>PostgreSQL<\/h3>\n<p>We collected a sample of <a href=\"http:\/\/stackoverflow.com\/\">StackOverflow<\/a>&#8216;s database, and gathered a sample of user-authored queries from their <a href=\"https:\/\/data.stackexchange.com\/stackoverflow\/query\/new\">query explorer<\/a>. We executed each query against our sample of the database, varying database configuration and monitoring query execution.<\/p>\n<p>11,252 subjects<br \/>\n3 treatments<br \/>\n7 outcomes<br \/>\n10 covariates<\/p>\n<\/div>\n<div class=\"col-md-4\">\n<h3>JDK<\/h3>\n<p>We downloaded a sample of <a href=\"https:\/\/maven.apache.org\/\">Maven<\/a>-enabled Java projects from <a href=\"https:\/\/github.com\/\">GitHub<\/a>. We compiled and ran the unit tests of each project, varying JDK options and monitoring runtime behavior.<\/p>\n<p>473 subjects<br \/>\n3 treatments<br \/>\n5 outcomes<br \/>\n5 covariates<\/p>\n<\/div>\n<div class=\"col-md-4\">\n<h3>Networking<\/h3>\n<p>We identified a pool of websites using a small web crawl. We then executed several web requests against each site, varying request parameters and monitoring output.<\/p>\n<p>2,599 subjects<br \/>\n3 treatments<br \/>\n5 outcomes<br \/>\n1 covariate<\/p>\n<\/div>\n<\/div>\n<div class=\"row\">\n<div class=\"col-md-4\">\n                    <a href=\"https:\/\/drive.google.com\/open?id=132ZXzPCQkPF94H83JI9VBxSmlncWnR1M\" class=\"btn btn-info\">Download (212M)<\/a><br \/>\n<a href=\"#postgres-details\" class=\"btn btn-default\">Details<br \/>\n<span class=\"glyphicon glyphicon-chevron-right\" aria-hidden=\"true\"><\/span><br \/>\n<\/a><\/div>\n<div class=\"col-md-4\">\n                    <a href=\"https:\/\/drive.google.com\/open?id=1qgGSzx7uB_9GLtqTITTr2jWxyrtNWftp\" class=\"btn btn-info\">Download (3.2M)<\/a><br \/>\n<a href=\"#jdk-details\" class=\"btn btn-default\">Details<br \/>\n<span class=\"glyphicon glyphicon-chevron-right\" aria-hidden=\"true\"><\/span><br \/>\n<\/a><\/div>\n<div class=\"col-md-4\">\n                    <a href=\"https:\/\/drive.google.com\/open?id=1UDksvZyEUe9LBZ5NXRnkWzQXeW2TS77c\" class=\"btn btn-info\">Download (78M)<\/a><br \/>\n<a href=\"#networking-details\" class=\"btn btn-default\">Details<br \/>\n<span class=\"glyphicon glyphicon-chevron-right\" aria-hidden=\"true\"><\/span><br \/>\n<\/a><\/div>\n<\/div>\n<hr>\n<div class=\"row\" id=\"postgres-details\">\n<h3>PostgreSQL Details<\/h3>\n<table class=\"table table-striped\">\n<thead>\n<tr>\n<th>Field Name<\/th>\n<th>Category<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>url<\/td>\n<td>subject identifier<\/td>\n<td>Identifies the source of the query<\/td>\n<\/tr>\n<tr>\n<td>trial<\/td>\n<td>trial identifier<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>index_level<\/td>\n<td>treatment<\/td>\n<td>Indicates the level of indexing employed.<br \/>\n0: No indexing<br \/>\n1: Indexing on primary keys and foreign keys only<br \/>\n2: Indexing on primary keys, foreign keys, and other commonly-referenced fields<\/td>\n<\/tr>\n<tr>\n<td>page_cost<\/td>\n<td>treatment<\/td>\n<td>Indicates the estimated disk access cost provided to the Postgres query planner. 0 corresponds the smallest disk access cost, 3 corresponds the largest disk access cost.<\/td>\n<\/tr>\n<tr>\n<td>memory_level<\/td>\n<td>treatment<\/td>\n<td>The amount of working memory provided to Postgres, in increasing increments from 0 to 2<\/td>\n<\/tr>\n<tr>\n<td>local_written_blocks<\/td>\n<td>outcome<\/td>\n<td>The number of blocks written to temporary tables andindices<\/td>\n<\/tr>\n<tr>\n<td>temp_written_blocks<\/td>\n<td>outcome<\/td>\n<td>The number of blocks written to short-term working memory<\/td>\n<\/tr>\n<tr>\n<td>shared_hit_blocks<\/td>\n<td>outcome<\/td>\n<td>Number of regular table blocks hit in the cache<\/td>\n<\/tr>\n<tr>\n<td>temp_read_blocks<\/td>\n<td>outcome<\/td>\n<td>Number of blocks read from temporary tables and indices<\/td>\n<\/tr>\n<tr>\n<td>local_read_blocks<\/td>\n<td>outcome<\/td>\n<td>Number of blocks read from temporary tables and indices<\/td>\n<\/tr>\n<tr>\n<td>runtime<\/td>\n<td>outcome<\/td>\n<td>Runtime of the query in milliseconds<\/td>\n<\/tr>\n<tr>\n<td>shared_read_blocks<\/td>\n<td>outcome<\/td>\n<td>Number of blocks read from regular tables and indices<\/td>\n<\/tr>\n<tr>\n<td>rows<\/td>\n<td>covariate<\/td>\n<td>The number of rows produced by the query<\/td>\n<\/tr>\n<tr>\n<td>creation_year<\/td>\n<td>covariate<\/td>\n<td>Year the query was created on the StackOverflow site.<\/td>\n<\/tr>\n<tr>\n<td>num_ref_tables<\/td>\n<td>covariate<\/td>\n<td>Number of tables referenced by the query<\/td>\n<\/tr>\n<tr>\n<td>num_joins<\/td>\n<td>covariate<\/td>\n<td>Number of joins employed by the query<\/td>\n<\/tr>\n<tr>\n<td>num_group_by<\/td>\n<td>covariate<\/td>\n<td>Number of &#8220;group by&#8221; clauses employed by the query<\/td>\n<\/tr>\n<tr>\n<td>queries_by_user<\/td>\n<td>covariate<\/td>\n<td>The number of other queries written by the author of this query<\/td>\n<\/tr>\n<tr>\n<td>length_chars<\/td>\n<td>covariate<\/td>\n<td>Length of the query in characters<\/td>\n<\/tr>\n<tr>\n<td>total_ref_rows<\/td>\n<td>covariate<\/td>\n<td>Total number of rows of all tables referenced in the query<\/td>\n<\/tr>\n<tr>\n<td>local_hit_blocks<\/td>\n<td>covariate<\/td>\n<td>Number of temporary table and temporary index blocks hit in the cache<\/td>\n<\/tr>\n<tr>\n<td>favorite_count<\/td>\n<td>covariate<\/td>\n<td>Number of times another StackOverflow user has marked this query as a favorite<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<hr>\n<div class=\"row\" id=\"jdk-details\">\n<h3>JDK Details<\/h3>\n<table class=\"table table-striped\">\n<thead>\n<tr>\n<th>Field Name<\/th>\n<th>Category<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>repo_name<\/td>\n<td>subject identifier<\/td>\n<td>Name of the GitHub repository containing the experimentation code<\/td>\n<\/tr>\n<tr>\n<td>trial<\/td>\n<td>trial identifier<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>debug<\/td>\n<td>treatment<\/td>\n<td>Indicates whether debug symbols were requested during compilation<\/td>\n<\/tr>\n<tr>\n<td>obfuscate<\/td>\n<td>treatment<\/td>\n<td>Indicates whether a code obfuscator was run on the final JAR file<\/td>\n<\/tr>\n<tr>\n<td>parallelgc<\/td>\n<td>treatment<\/td>\n<td>Indicates whether a parallel garbage collection was employed during execution (instead of serial garbage collection)<\/td>\n<\/tr>\n<tr>\n<td>num_bytecode_ops<\/td>\n<td>outcome<\/td>\n<td>Number of bytecode instructions in the compiled code<\/td>\n<\/tr>\n<tr>\n<td>total_unit_test_time<\/td>\n<td>outcome<\/td>\n<td>Number of seconds required to execute unit tests<\/td>\n<\/tr>\n<tr>\n<td>allocated_bytes<\/td>\n<td>outcome<\/td>\n<td>Number of bytes allocated during execution of unit tests<\/td>\n<\/tr>\n<tr>\n<td>jar_file_size_bytes<\/td>\n<td>outcome<\/td>\n<td>Size of JAR file after compilation (and possibly obfuscation)<\/td>\n<\/tr>\n<tr>\n<td>compile_time_ms<\/td>\n<td>outcome<\/td>\n<td>Number of milliseconds to compile the source<\/td>\n<\/tr>\n<tr>\n<td>source_ncss<\/td>\n<td>covariate<\/td>\n<td>Number of non-comment source statements in the source code<\/td>\n<\/tr>\n<tr>\n<td>test_classes<\/td>\n<td>covariate<\/td>\n<td>Number of Java classes in the unit test source<\/td>\n<\/tr>\n<tr>\n<td>test_functions<\/td>\n<td>covariate<\/td>\n<td>Number of functions in the unit test source<\/td>\n<\/tr>\n<tr>\n<td>test_ncss<\/td>\n<td>covariate<\/td>\n<td>Number of non-comment source statements in the test source<\/td>\n<\/tr>\n<tr>\n<td>test_javadocs<\/td>\n<td>covariate<\/td>\n<td>Number of JavaDoc comments in the test source<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<hr>\n<div class=\"row\" id=\"networking-details\">\n<h3>Networking Details<\/h3>\n<table class=\"table table-striped\">\n<thead>\n<tr>\n<th>Field Name<\/th>\n<th>Category<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>url<\/td>\n<td>subject identifier<\/td>\n<td>Web address requested<\/td>\n<\/tr>\n<tr>\n<td>trial<\/td>\n<td>trial identifier<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>mobile_user_agent<\/td>\n<td>treatment<\/td>\n<td>Indicates if the site was requested with a mobile user agent<\/td>\n<\/tr>\n<tr>\n<td>proxy<\/td>\n<td>treatment<\/td>\n<td>Indicates if the site was requested through a proxy server<\/td>\n<\/tr>\n<tr>\n<td>compression<\/td>\n<td>treatment<\/td>\n<td>Indicates if the request indicates that compression is supported<\/td>\n<\/tr>\n<tr>\n<td>html_attrs<\/td>\n<td>outcome<\/td>\n<td>Number of HTML attributes in the response<\/td>\n<\/tr>\n<tr>\n<td>html_tags<\/td>\n<td>outcome<\/td>\n<td>Number of HTML tags in the response<\/td>\n<\/tr>\n<tr>\n<td>elapsed<\/td>\n<td>outcome<\/td>\n<td>Elapsed time between request and response in seconds<\/td>\n<\/tr>\n<tr>\n<td>decompressed_content_length<\/td>\n<td>outcome<\/td>\n<td>Length of the response content, in bytes, after decompression<\/td>\n<\/tr>\n<tr>\n<td>raw_content_length<\/td>\n<td>outcome<\/td>\n<td>Length of the response content, in bytes, before decompression<\/td>\n<\/tr>\n<tr>\n<td>server.class<\/td>\n<td>covariate<\/td>\n<td>Category of web server issuing the response<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>To obtain realistic data for evaluating causal discovery algorithms, we designed and executed factorial experiments on three large-scale software systems: PostgreSQL, the Java Development Kit, and web server infrastructure. Each domain is characterized by three classes of variables: covariates, treatments, and outcomes. Under the factorial experiment design, outcomes were measured for every combination of subjects &hellip; <a href=\"https:\/\/groups.cs.umass.edu\/kdl\/causal-eval-data\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Evaluating Causal Discovery Algorithms with Software Systems&#8221;<\/span><\/a><\/p>\n","protected":false},"author":24,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-67","page","type-page","status-publish","hentry","hfeed"],"_links":{"self":[{"href":"https:\/\/groups.cs.umass.edu\/kdl\/wp-json\/wp\/v2\/pages\/67","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/groups.cs.umass.edu\/kdl\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/groups.cs.umass.edu\/kdl\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/groups.cs.umass.edu\/kdl\/wp-json\/wp\/v2\/users\/24"}],"replies":[{"embeddable":true,"href":"https:\/\/groups.cs.umass.edu\/kdl\/wp-json\/wp\/v2\/comments?post=67"}],"version-history":[{"count":3,"href":"https:\/\/groups.cs.umass.edu\/kdl\/wp-json\/wp\/v2\/pages\/67\/revisions"}],"predecessor-version":[{"id":71,"href":"https:\/\/groups.cs.umass.edu\/kdl\/wp-json\/wp\/v2\/pages\/67\/revisions\/71"}],"wp:attachment":[{"href":"https:\/\/groups.cs.umass.edu\/kdl\/wp-json\/wp\/v2\/media?parent=67"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}