Resources

The TREC Deep Learning Track 2020 – Quick Start

This is a quick start guide for the document ranking task in the TREC Deep Learning (TREC-DL) benchmark. If you are new to TREC-DL, then this repository may make it more convenient for you to download all the required datasets and then train and evaluate a relatively efficient deep neural baseline on this benchmark, under both the rerank and the fullrank settings.

The TREC DL Track Quick Start is available through this GitHub repository: https://github.com/bmitra-msft/TREC-Deep-Learning-Quick-Start

The Quick Start is based on an efficient yet effective model called Conformer-Kernel with Query Term Independence. For more information, please refer to the following article:

: . .

MIMICS: Microsoft’s Mixed-Initiative Conversational Search Data

Developing mixed-initiative conversational search systems is a desired yet challenging task. Asking a clarification has also been recognized as one of the necessary steps towards designing mixed-initiative conversational systems. To speed up the research progress in the area, we built and released MIMICS, a collection of search clarification datasets for real search queries sampled from the Bing query logs. Each clarification in MIMICS consists of a clarifying question and up to five candidate answers.

MIMICS contains three datasets:

  • MIMICS-Click includes over 400k unique queries, their associated clarification panes, and the corresponding aggregated user interaction signals (i.e., clicks).
  • MIMICS-ClickExplore is an exploration data that includes aggregated user interaction signals for over 60k unique queries, each with multiple clarification panes.
  • MIMICS-Manual includes over 2k unique real search queries. Each query-clarification pair in this dataset has been manually labeled by at least three trained annotators. It contains graded quality labels for the clarifying question, the candidate answer set, and the landing result page for each candidate answer.

MIMICS is accessible through this GitHub repository: https://github.com/microsoft/mimics

You can cite the following article, if you found MIMICS useful:

: . .

Macaw: An Extensible Conversational Information Seeking Platform

Conversational information seeking (CIS) has been recognized as a major emerging research area in information retrieval. Such research will require data and tools, to allow the implementation and study of conversational systems. Macaw is an open-source framework with a modular architecture for CIS research. Macaw supports multi-turn, multi-modal, and mixed-initiative interactions, for tasks such as document retrieval, question answering, recommendation, and structured data exploration. It has a modular design to encourage the study of new CIS algorithms, which can be evaluated in batch mode. It can also integrate with a user interface, which allows user studies and data collection in an interactive mode, where the back end can be fully algorithmic or a wizard of oz setup.

Macaw could be of interest to the researchers and practitioners working on information retrieval, natural language processing, and dialogue systems.

Macaw is open-sourced under the MIT License, and is accessible through this GitHub repository: https://github.com/microsoft/macaw

You can cite the following article, if you found Macaw useful:

: . .

Movie-Search-ML20: A Known-Item Movie Search Dataset Linked to MovieLens 20M

This dataset is created based on the questions asked by real users in Yahoo! Answers, a community question answering website. We adopt the data collected by Hagen et al. (2015) from Yahoo! Answers. This dataset focus on the questions in the “movies” category. This is considered as a known-item search task, where the questions are long and descriptive. There is a single relevant movie to each question. We manually mapped each question to its relevant movie ID in the MovieLens 20M dataset. This allows researchers to have both user-item interactions and query-item relevance signal on the same item set.

The dataset is publicly available for research purposes (click here). Citation:

: . .


Note that this dataset is based on Hagen et al.’s data. If you use this dataset, we encourage you to refer to their work as well.

ANTIQUE: A Non-Factoid Question Answering Dataset

ANTIQUE is an open-domain non-factoid question answering benchmark, collected from diverse categories of Yahoo! Answers.The main characteristics of this dataset include:

  • it solely focuses on non-factoid questions.
  • it contains relevance judgments for multiple candidate answers per question.
  • the relevance judgments were collected through crowdsourcing.
  • it contains over 2.6k questions with over 34k relevance annotations, making ANTIQUE a suitable collection for training complex machine learning QA models, e.g., neural nets.

The dataset is publicly available for research purposes (click here). Citation:

: . .

Qulac: A Dataset for Offline Evaluation of Asking Clarifying Questions

Asking clarifying questions in information retrieval is a technique to identify the user intent behind the submitted query. Clarifying questions are particularly useful in information seeking systems with limited bandwidth interfaces, such as small screens or speech-only interfaces. Qulac was constructed based on the Web search queries used in TREC Web Track 2009-2012 (the ClueWeb09-Category B collection). The clarifying questions and their answers for different facets of the query were collected through crowdsourcing. The constructed data consists of over 10k question-answer pairs on ~200 topics. This dataset enables researchers to evaluate methods for selecting clarifying questions (i.e., offline evaluation). This dataset is a result of a joint effort by researchers from the Università della Svizzera italiana (USI), Lugano, Switzerland and the University of Massachusetts, Amherst, MA, USA. To download the Qulac dataset, please visit here. Citation:

: . .

Telegram (and Twitter) News Data

Telegram is a popular and fast growing instant messaging service. TelegramNews is a collection of news posts published by a set of popular news agencies through Telegram. This dataset also contains the tweets produced by the same news agencies in the same time period. TelegramNews is publicly available for research purposes. It has been used for news popualrity detection and analysis. It can potentially be used for various tasks, such as news summarization.

This dataset contains telegram posts published by a set of news agencies from their starting date until October 8, 2017. The dataset contains the news posts by BBC, BBC Persian, CNN, Press TV, Reuters World, The Guardian, and Washington Post. The dataset is publicly available here. Citation:

: . .

Standalone Neural Ranking Model (SNRM)

SNRM is a framework based on neural networks for end to end document retrieval. SNRM learns a sparse representation for both queries and documents and builds an inverted index based on the learned representations. At the query time, it retrieves documents directly from the whole collection using the learned inverted index. An open-source implementation of SNRM is available here.

: . .

ISTAS: An In Situ Dataset for Target Apps Selection

This dataset provides an in situ dataset for target apps selection as part of a unified mobile search system (also see the UniMobile dataset below). In contrast to the UniMobile dataset, ISTAS contains more realistic queries with associated contextual information captured from the mobile sensors and logs of background processes. ISTAS includes over 5000 queries. This dataset is a result of a joint effort by researchers from the Università della Svizzera italiana (USI), Lugano, Switzerland and the University of Massachusetts, Amherst, MA, USA. To download the ISTAS dataset, please visit here. Citation:

: . .

UniMobile: A Collection of Cross-App Mobile Search Queries

As the first step towards developing a unified search framework for mobile devices, the task of Target Apps Selection has been defined. To train and evaluate models for this task, a dataset with over 5000 queries has been built using crowdsourcing. This dataset is a result of a joint effort by researchers from the Università della Svizzera italiana (USI), Lugano, Switzerland and the University of Massachusetts, Amherst, MA, USA. To download the UniMobile dataset, please visit here. Citation:

: . .

Citation Worthiness Dataset

Does this sentence need citation? To train and evaluate models for addressing this question, we construct a citation worthiness dataset using the articles of ACL Anthology Reference Corpus (ARC). We use the SEPIC corpus, which includes sentence-level segmentation of 10,921 articles from ACL ARC 1.0, up to February 2007. The sentence splitter and chunker of the Apache OpenNLP 1.5 3 in addition to the Stanford tokenizer and POS tagger, and the MaltParser tools were used. More information is provided in the following paper, and the data can be downloaded from here.

: . .

Million Playlist Dataset

The ACM Recommender Systems Challenge 2018 focuses on a novel task in the field of recommender systems and information retrieval: Automatic Playlist Continuation. RecSys Challenge 2018 is organized by Spotify, University of Massachusetts Amherst, and Johannes Kepler University Linz. For this challenge, Spotify has released a dataset containing one million playlists generated by Spotify users. Please visit http://www.recsyschallenge.com/2018/ for more information. Citation:

: . .

Tweet Rating Dataset

This dataset contains tweets of users about the items of four popular and diverse web applications: IMDb (movie), YouTube (video clip), Pandora (music), and Goodreads (book). This dataset contains ~500K tweets from ~20K users about ~230K items (movie, music, etc.). This dataset is freely available for research purposes. Tweet Rating Dataset can be downloaded from here. Citation:

: . .

Wikipedia English-Persian Parallel Corpus

This parallel corpus is automatically extracted from English and Persian Wikipedia articles. We extensively evaluate our created parallel corpus to show its high quality compared to the existing English-Persian parallel corpora. This dataset is freely available for research purposes. To download the parallel corpus, please visit here. Citation:

: . .