Publications Search – Knowledge Discovery Lab

Lisa Friedland, David Jensen, Michael Lavine

Copy or Coincidence? A Model for Detecting Social Influence and Duplication Events Proceedings Article

In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pp. 1175–1183, JMLR.org, 2013.

Abstract | Links | BibTeX | Tags:

Marc Maier, Katerina Marazopoulou, David Arbour, David Jensen

Flattening network data for causal discovery: What could go wrong? Proceedings Article

In: Workshop on Information in Networks, 2013.

Abstract | Links | BibTeX | Tags:

@inproceedings{maier2013flattening,

title = {Flattening network data for causal discovery: What could go wrong?},

author = {Marc Maier and Katerina Marazopoulou and David Arbour and David Jensen},

url = {https://www.semanticscholar.org/paper/Flattening-network-data-for-causal-discovery-%3A-What-Maier-Marazopoulou/c327100636c022c259f5e1bf2d7fcbbd0b048935},

year  = {2013},

date = {2013-01-01},

booktitle = {Workshop on Information in Networks},

volume = {64},

abstract = {Methods for learning causal dependencies from observational data have been the focus of decades of work in social science, statistics, machine learning, and philosophy [9, 10, 11]. Much of the theoretical and practical work on causal discovery has focused on propositional representations. Propositional models effectively represent individual directed causal dependencies (e.g., path analysis, Bayesian networks) or conditional distributions of some outcome variable (e.g., linear regression, decision trees). However, propositional representations are limited to modeling independent and identically distributed (IID) data of a single entity type. Many real-world systems involve heterogeneous, interacting entities with probabilistic dependencies that cross the boundaries of those entities (i.e., non-IID data with multiple entity types and relationships). These systems produce network, or relational, data, and they are of paramount interest to researchers and practitioners across a wide range of disciplines. To model such data, researchers in statistics and computer science have devised more expressive classes of directed graphical models, such as probabilistic relational models (PRMs) [2] and directed acyclic probabilistic entityrelationship (DAPER) models [4]. Despite the assumptions embedded in propositional models, a common practice is to flatten, or propositionalize, relational data and use existing algorithms [5] (see Figure 1, focusing on algorithms that learn causal graphical models). While there are statistical concerns, this process is generally innocuous if the task is to model statistical associations for predictive inference. In contrast, to learn causal structure, estimate causal effects, or support inference over interventions, the effects of flattening inherently relational data can be particularly deleterious. In this paper, we identify four classes of potential issues that can occur with a propositionalization strategy as opposed to embracing a more expressive representation that would not succumb to these problems. We also present empirical results comparing the effectiveness of two theoretically sound and complete algorithms that learn causal structure: PC—a widely used constraint-based, propositional algorithm for causal discovery [11], and RCD—a recently developed constraint-based algorithm that reasons over a relational representation [6].},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Methods for learning causal dependencies from observational data have been the focus of decades of work in social science, statistics, machine learning, and philosophy [9, 10, 11]. Much of the theoretical and practical work on causal discovery has focused on propositional representations. Propositional models effectively represent individual directed causal dependencies (e.g., path analysis, Bayesian networks) or conditional distributions of some outcome variable (e.g., linear regression, decision trees). However, propositional representations are limited to modeling independent and identically distributed (IID) data of a single entity type. Many real-world systems involve heterogeneous, interacting entities with probabilistic dependencies that cross the boundaries of those entities (i.e., non-IID data with multiple entity types and relationships). These systems produce network, or relational, data, and they are of paramount interest to researchers and practitioners across a wide range of disciplines. To model such data, researchers in statistics and computer science have devised more expressive classes of directed graphical models, such as probabilistic relational models (PRMs) [2] and directed acyclic probabilistic entityrelationship (DAPER) models [4]. Despite the assumptions embedded in propositional models, a common practice is to flatten, or propositionalize, relational data and use existing algorithms [5] (see Figure 1, focusing on algorithms that learn causal graphical models). While there are statistical concerns, this process is generally innocuous if the task is to model statistical associations for predictive inference. In contrast, to learn causal structure, estimate causal effects, or support inference over interventions, the effects of flattening inherently relational data can be particularly deleterious. In this paper, we identify four classes of potential issues that can occur with a propositionalization strategy as opposed to embracing a more expressive representation that would not succumb to these problems. We also present empirical results comparing the effectiveness of two theoretically sound and complete algorithms that learn causal structure: PC—a widely used constraint-based, propositional algorithm for causal discovery [11], and RCD—a recently developed constraint-based algorithm that reasons over a relational representation [6].

Close

Marc Maier, Katerina Marazopoulou, David Jensen

Reasoning about Independence in Probabilistic Models of Relational Data Miscellaneous

2013.

Abstract | Links | BibTeX | Tags: Causal Modeling

Matthew Rattigan

Leveraging Relational Representations for Causal Discovery PhD Thesis

2012, ISBN: 9781267786821, (AAI3545976).

Abstract | BibTeX | Tags:

@phdthesis{10.5555/2520420,

title = {Leveraging Relational Representations for Causal Discovery},

author = {Matthew Rattigan},

isbn = {9781267786821},

year  = {2012},

date = {2012-01-01},

publisher = {University of Massachusetts Amherst},

abstract = {This thesis represents a synthesis of relational learning and causal discovery, two subjects at the frontier of machine learning research. Relational learning investigates algorithms for constructing statistical models of data drawn from of multiple types of interrelated entities, and causal discovery investigates algorithms for constructing causal models from observational data. My work demonstrates that there exists a natural, methodological synergy between these two areas of study, and that despite the sometimes onerous nature of each, their combination (perhaps counterintuitively) can provide advances in the state of the art for both. Traditionally, propositional (or "flat") data representations have dominated the statistical sciences. These representations assume that data consist of independent and identically distributed (iid) entities which can be represented by a single data table. More recently, data scientists have increasingly focused on "relational" data sets that consist of interrelated, heterogeneous entities. However, relational learning and causal discovery are rarely combined. Relational representations are wholly absent from the literature where causality is discussed explicitly. Instead, the literature on causality that uses the framework of graphical models assumes that data are independent and identically distributed. This unexplored topical intersection represents an opportunity for advancement — by combining relational learning with causal reasoning, we can provide insight into the challenges found in each subject area. By adopting a causal viewpoint, we can clarify the mechanisms that produce previously identified pathologies in relational learning. Analogously, we can utilize relational data to establish and strengthen causal claims in ways that are impossible using only propositional representations.},

note = {AAI3545976},

keywords = {},

pubstate = {published},

tppubtype = {phdthesis}

}

Close

Huseyin Oktay, A Soner Balkir, Ian Foster, David Jensen

Distance estimation for very large networks using mapreduce and network structure indices Proceedings Article

In: Workshop on Information Networks, 2011.

Abstract | BibTeX | Tags:

Marc Maier, Matthew Rattigan, David Jensen

Indexing Network Structure with Shortest-Path Trees Journal Article

In: ACM Trans. Knowl. Discov. Data, vol. 5, no. 3, 2011, ISSN: 1556-4681.

Abstract | Links | BibTeX | Tags: Navigation and Routing in Networks

Phillip B Kirlin, David Jensen

Probabilistic Modeling of Hierarchical Music Analysis. Proceedings Article

In: Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR, pp. 393–398, 2011.

Abstract | Links | BibTeX | Tags:

Matthew Rattigan, Marc Maier, David Jensen

Relational blocking for causal discovery Proceedings Article

In: Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.

Abstract | Links | BibTeX | Tags: Causal Modeling

Michael Hay, Gerome Miklau, David Jensen

Analyzing private network data Journal Article

In: Privacy-aware knowledge discovery: Novel applications and new techniques, pp. 459–498, 2010.

BibTeX | Tags:

Huseyin Oktay, Brian Taylor, David Jensen

Causal discovery in social media using quasi-experimental designs Proceedings Article

In: Proceedings of the 3rd Workshop on Social Network Mining and Analysis, SNAKDD, pp. 1–9, 2010.

Abstract | Links | BibTeX | Tags: Causal Modeling

Marc Maier, Brian Taylor, Huseyin Oktay, David Jensen

Learning causal models of relational domains Proceedings Article

In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, 2010.

Abstract | Links | BibTeX | Tags: Causal Modeling

Matthew Rattigan, David Jensen

Leveraging d-separation for relational data sets Proceedings Article

In: ICDM 2010, The 10th IEEE International Conference on Data Mining, pp. 989–994, IEEE 2010.

Abstract | Links | BibTeX | Tags:

Michael Hay, Gerome Miklau, David Jensen, Don Towsley, Chao Li

Resisting structural re-identification in anonymized social networks Journal Article

In: The VLDB Journal, vol. 19, no. 6, pp. 797–823, 2010.

Abstract | Links | BibTeX | Tags:

Brian Delaney, Andrew Fast, W Campbell, C Weinstein, David Jensen

The application of statistical relational learning to a database of criminal and terrorist activity Proceedings Article

In: Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 409–417, Society for Industrial and Applied Mathematics 2010.

Abstract | Links | BibTeX | Tags:

Michael Hay, Chao Li, Gerome Miklau, David Jensen

Accurate Estimation of the Degree Distribution of Private Networks Proceedings Article

In: ICDM 2009, The Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009, pp. 169–178, IEEE Computer Society, 2009.

Abstract | Links | BibTeX | Tags: Privacy and Networks

Andrew Fast, David Jensen

Constraint relaxation for learning the structure of Bayesian networks Technical Report

Tech Report 09-18, University of Massachusetts Amherst, Computer Science~… 2009.

Abstract | Links | BibTeX | Tags:

Jennifer Neville, David Jensen

A bias/variance decomposition for models using collective inference Journal Article

In: Machine Learning, vol. 73, no. 1, pp. 87–106, 2008.

Abstract | Links | BibTeX | Tags:

David Jensen, Andrew Fast, Brian Taylor, Marc Maier

Automatic identification of quasi-experimental designs for discovering causal knowledge Proceedings Article

In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, pp. 372–380, 2008.

Abstract | Links | BibTeX | Tags:

@inproceedings{jensen2008automatic,

title = {Automatic identification of quasi-experimental designs for discovering causal knowledge},

author = {David Jensen and Andrew Fast and Brian Taylor and Marc Maier},

url = {https://doi.org/10.1145/1401890.1401938},

year  = {2008},

date = {2008-01-01},

booktitle = {Proceedings of the 14th ACM SIGKDD International Conference on 

 Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 

 24-27, 2008},

pages = {372--380},

abstract = {Researchers in the social and behavioral sciences routinely rely on quasi-experimental designs to discover knowledge from large data-bases. Quasi-experimental designs (QEDs) exploit fortuitous circumstances in non-experimental data to identify situations (sometimes called "natural experiments") that provide the equivalent of experimental control and randomization. QEDs allow researchers in domains as diverse as sociology, medicine, and marketing to draw reliable inferences about causal dependencies from non-experimental data. Unfortunately, identifying and exploiting QEDs has remained a painstaking manual activity, requiring researchers to scour available databases and apply substantial knowledge of statistics. However, recent advances in the expressiveness of databases, and increases in their size and complexity, provide the necessary conditions to automatically identify QEDs. In this paper, we describe the first system to discover knowledge by applying quasi-experimental designs that were identified automatically. We demonstrate that QEDs can be identified in a traditional database schema and that such identification requires only a small number of extensions to that schema, knowledge about quasi-experimental design encoded in first-order logic, and a theorem-proving engine. We describe several key innovations necessary to enable this system, including methods for automatically constructing appropriate experimental units and for creating aggregate variables on those units. We show that applying the resulting designs can identify important causal dependencies in real domains, and we provide examples from academic publishing, movie making and marketing, and peer-production systems. Finally, we discuss the integration of QEDs with other approaches to causal discovery, including joint modeling and directed experimentation.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

David Jensen, Andrew Fast, Brian Taylor, Marc Maier

Automatic Identification of Quasi-Experimental Designs for Discovering Causal Knowledge Proceedings Article

In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 372–380, Association for Computing Machinery, Las Vegas, Nevada, USA, 2008, ISBN: 9781605581934.

Abstract | Links | BibTeX | Tags: causal discovery, Causal Modeling, quasi-experimental design

@inproceedings{10.1145/1401890.1401938,

title = {Automatic Identification of Quasi-Experimental Designs for Discovering Causal Knowledge},

author = {David Jensen and Andrew Fast and Brian Taylor and Marc Maier},

url = {https://doi.org/10.1145/1401890.1401938},

doi = {10.1145/1401890.1401938},

isbn = {9781605581934},

year  = {2008},

date = {2008-01-01},

booktitle = {Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},

pages = {372–380},

publisher = {Association for Computing Machinery},

address = {Las Vegas, Nevada, USA},

series = {KDD '08},

abstract = {Researchers in the social and behavioral sciences routinely rely on quasi-experimental designs to discover knowledge from large data-bases. Quasi-experimental designs (QEDs) exploit fortuitous circumstances in non-experimental data to identify situations (sometimes called "natural experiments") that provide the equivalent of experimental control and randomization. QEDs allow researchers in domains as diverse as sociology, medicine, and marketing to draw reliable inferences about causal dependencies from non-experimental data. Unfortunately, identifying and exploiting QEDs has remained a painstaking manual activity, requiring researchers to scour available databases and apply substantial knowledge of statistics. However, recent advances in the expressiveness of databases, and increases in their size and complexity, provide the necessary conditions to automatically identify QEDs. In this paper, we describe the first system to discover knowledge by applying quasi-experimental designs that were identified automatically. We demonstrate that QEDs can be identified in a traditional database schema and that such identification requires only a small number of extensions to that schema, knowledge about quasi-experimental design encoded in first-order logic, and a theorem-proving engine. We describe several key innovations necessary to enable this system, including methods for automatically constructing appropriate experimental units and for creating aggregate variables on those units. We show that applying the resulting designs can identify important causal dependencies in real domains, and we provide examples from academic publishing, movie making and marketing, and peer-production systems. Finally, we discuss the integration of QEDs with other approaches to causal discovery, including joint modeling and directed experimentation.},

keywords = {causal discovery, Causal Modeling, quasi-experimental design},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Andrew Fast, Michael Hay, David Jensen

Improving accuracy of constraint-based structure learning Technical Report

Technical report 08-48, University of Massachusetts Amherst, Computer~… 2008.

Abstract | Links | BibTeX | Tags:

@techreport{fast2008improving,

title = {Improving accuracy of constraint-based structure learning},

author = {Andrew Fast and Michael Hay and David Jensen},

url = {https://www.researchgate.net/profile/David-Jensen-10/publication/228854891_Improving_Accuracy_of_Constraint-Based_Structure_Learning/links/09e41510892d741c18000000/Improving-Accuracy-of-Constraint-Based-Structure-Learning.pdf},

year  = {2008},

date = {2008-01-01},

institution = {Technical report 08-48, University of Massachusetts Amherst, Computer~…},

abstract = {Hybrid algorithms for learning the structure of Bayesian networks combine techniques from both the constraintbased and search-and-score paradigms of structure learning. One class of hybrid approaches uses a constraintbased algorithm to learn an undirected skeleton identifying edges that should appear in the final network. This skeleton is used to constrain the model space considered by a search-and-score algorithm to orient the edges and produce a final model structure. At small sample sizes, the performance of models learned using this hybrid approach do not achieve likelihood as high as models learned by unconstrained search. Low performance is a result of errors made by the skeleton identification algorithm, particularly false negative errors, which lead to an over-constrained search space. These errors are often attributed to “noisy” hypothesis tests that are run during skeleton identification. However, at least three specific sources of error have been identified in the literature: unsuitable hypothesis tests, lowpower hypothesis tests, and unexplained d-separation. No previous work has considered these sources of error in combination. We determine the relative importance of each source individually and in combination. We identify that low-power tests are the primary source of false negative errors, and show that these errors can be corrected by a novel application of statistical power analysis. The result is a new hybrid algorithm for learning the structure of Bayesian networks which produces models with equivalent likelihood to models produced by unconstrained greedy search, using only a fraction of the time.},

keywords = {},

pubstate = {published},

tppubtype = {techreport}

}

Close

Search Google Appliance