Here you’ll find an organized list of interesting, high-quality datasets for machine learning research. We welcome your contributions for curating this list! You can find other lists of such datasets on Wikipedia, for example.
ImageNet: The de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category; WordNet; hierarchy from ImageNet.
LSUN: Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition.
MS COCO: Generic image understanding / captioning, with an associated competition.
COIL 20: Different objects imaged at every angle in a 360 rotation.
COIL100 : Different objects imaged at every angle in a 360 rotation.
Google’s Open Images: A collection of 9 million URLs to images “that have been annotated with labels spanning over 6,000 categories” under Creative Commons.
Arcade Universe: - An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s bugland dataset generator.
A collection of datasets inspired by the ideas from BabyAISchool
Labelled Faces in the Wild: 13,000 cropped facial regions (using; Viola-Jones that have been labeled with a name identifier. A subset of the people present have two images in the dataset — it’s quite common for people to train facial matching systems here.
UMD Faces Annotated dataset of 367,920 faces of 8,501 subjects.
CASIA WebFace Facial dataset of 453,453 images over 10,575 identities after face detection. Requires some filtering for quality.
MS-Celeb-1M 1 million images of celebrities from around the world. Requires some filtering for best results on deep networks.
Olivetti: A few images of several different people.
20 newsgroups: Classification task, mapping word occurences to newsgroup ID. One of the classic datasets for text classification) usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.
Reuters News dataset: (Older) purely classification-based dataset with text from the newswire. Commonly used in tutorial.
Penn Treebank: Used for next word prediction or next character prediction.
UCI’s Spambase: (Older) classic spam email dataset from the famous UCI Machine Learning Repository. Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized spam filtering.
Broadcast News: Large text dataset, classically used for next word prediction.
Text Classification Datasets: From; Zhang et al., 2015; An extensive set of eight datasets for text classification. These are the benchmark for new text classification baselines. Sample size of 120K to 3.6M, ranging from binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo! and AG.
WikiText: A large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind.
SQuAD: The Stanford Question Answering Dataset — broadly useful question answering and reading comprehension dataset, where every answer to a question is posed as a segment of text.
Billion Words dataset: A large general-purpose language modeling dataset. Often used to train distributed word representations such as word2vec.
Common Crawl: Petabyte-scale crawl of the web — most frequently used for learning word embeddings. Available for free from Amazon S3. Can also be useful as a network dataset for it’s a crawl of the WWW.
Google Books Ngrams: Successive words from Google books. Offers a simple method to explore when a word first entered wide usage.
Quora Question Pairs: first dataset release from Quora containing duplicate / semantic similarity labels.
CMU Q/A Dataset: Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles.
Maluuba goal-oriented dialogue: Procedural conversational dataset where the dialogue aims at accomplishing a task or taking a decision. Often used to work on chat bots.
bAbi: Synthetic reading comprehension and question answering datasets from Facebook AI Research (FAIR).
The Children’s Book Test: Baseline of (Question + context, Answer) pairs extracted from Children’s books available through Project Gutenberg. Useful for question-answering (reading comprehension) and factoid look-up.
Movielens: Movie ratings dataset from the Movielens website, in various sizes ranging from demo to mid-size.
Million Song Dataset: Large, metadata-rich, open source dataset on Kaggle that can be good for people experimenting with hybrid recommendation systems.
Last.fm: Music recommendation dataset with access to underlying social network and other metadata that can be useful for hybrid systems.
Book-Crossing dataset:: From the Book-Crossing community. Contains 278,858 users providing 1,149,780 ratings about 271,379 books.
Jester: 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
Netflix Prize:: Netflix released an anonymized version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies. First major Kaggle style data challenge. Only available unofficially, as privacy issues arose.
Amazon Co-Purchasing: Amazon Reviews crawled data from “the users who bought this also bought…” section of Amazon, as well as Amazon review data for related products. Good for experimenting with recommendation systems in networks.
2000 HUB5 English: English-only speech data used most recently in the Deep Speech paper from Baidu.
LibriSpeech: Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by chapters of the book containing both the text and the speech.
VoxForge: Clean speech dataset of accented english. Useful for instances in which you expect to need robustness to different accents or intonations.
CHIME: Noisy speech recognition challenge dataset. Dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
TED-LIUM: Audio transcription of TED talks. 1495 TED talks audio recordings along with full text transcriptions of those recordings.
Thanks to deeplearning.net and Luke de Oliveira for many of these links and dataset descriptions. Any suggestions of open data sets we should include for the Deeplearning4j community are welcome! via: http://sidgan.me/technical/2016/01/09/Exploring-Datasets
Exploring Image Captioning Datasets
PASCAL SENTENCE DATASET
PASCAL stands for Pattern Analysis Statistical Modeling and Computational learning. It has 3 tasks:
The dataset has 20 classes, including aeroplane, bicycle, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, train, TV.
For selecting this dataset, no quality filter was applied, the complete dataset has been directly downloaded from Flickr. Because of no filtering, there are complex scenes, scaling, view points of different objects, unnatural lighting. Training set consists of 10,103 images with 23,374 objects such that there are approximately 500 training objects per category. Each object is segmented completely with its bounding box. Standard evaluation method of average precision per class is used with train/test and validation splits.
Some challenges include:
Action Classification Taster Challenge
Given a bounding box predict whether the person in the bounding box is performing an action or not.
Action classes include, Calling, playing an instrument, reading, riding bike, riding horse, running, taking a photo, using the computer, walking, jumping and ?other? class.
Given the bounding box of a person, predict the spatial position of the head, hands and feet. This encourages research on image interpretation.
Publically available dataset: http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html
The Flickr 8K dataset includes images obtained from the Flickr website.
University of Illinois at Urbana, Champaign has the sole link of this dataset.
The images do not contain any famous person or place so that the entire image can be learnt based on all the different objects in the image.
Image search is based on associating the query text with the tags of the image that help to identify the image. Identification of images then becomes multi-label classification problem of associating images with individual words or tags. It is a much harder problem to automatically associate images with complete and novel descriptions of images such as captions.
Multiple captions for each image are taken because there is a great amount of variance that is possible in the captions that can be written to describe a single image. This also helps satisfy the dynamic nature of images. There are multiple objects in the image but in a caption usually the main subject and either one or two of the secondary subjects are included in the caption.
This dataset does not focus on iconic views that contain just one view of a single object such that objects in the background, partially occluded and amidst clutter are also present and are important for image retrieval tasks. Iconic images contain a single subject within the image frame that is clearly defined and occupies most space within the image frame. The object would be complete or at least easily recognizable. Detailed spatial understanding of the object layout is a core component of scene analysis. An objects spatial location can be defined coarsely using a bounding box or with precise pixel level segmentations. Image datasets like ground truth stereo and optical flow datasets promote tracking of movement of one object from one frame to another.
Three core problems in scene understanding that are presented in this paper are:
Detecting non-iconic views or non-canonical perspectives
Contextual reasoning between objects
2D localization of objects
It has instance level segmentation which means that the dataset has every instance of every object category labelled and fully segmented.
Pair of objects in conjunction with images retrieved using scene based queries.
Labeling is done as the image containing a particular objet category using hierarchal labelling
Individual instances are labeled and verified and finally segmented
91 common object categories with 82 having more than 50000 labelled instances.
2,500,000 instances in 328,000 images
Fewer categories but more instances per category, which enables better learning and makes this a richer dataset on which the max score is less as compared to PASCAL dataset.
Number of labeled instances per image help in learning contextual information
Datasets can be addressed to one out of three kinds of problems:
Binary labels that indicate if a image belongs to a category or not.
Detect if an object is present and if present to what class of objects does it belong to.
After detecting an object, localize its position within the image
It is difficult to detect objects that are not in their natural surroundings
Semantic scene labeling
Each pixel of an image is to be labeled as belonging to a category
This enables the labeling of objects for which instances are hard to define
It is a dataset for question answering (natural language sentences) based on real world images( which include indoor scenes).
Vision and Language
A set of images and questions about their content is presented.
DAQUAR questions contains 1088 nouns, while answers contain 803 nouns along with other POS.
Common sense knowledge
Search space can be restricted based on understanding non-visual cues from the questions.
Questions are asked based on spatial concepts within different frame of reference.
Settling on the perfect metric is difficult because:
Understand natural language and try to gauge answers
Individual judging of every answer is a difficult task so an automatic approximation is reached.
The gradual category membership of of human perception is variable and brings ambiguity
Multiple interpretations of questions are possible leading to many correct answers
The equivalence class among the answers is considered by assigning similar scores to members of the same class.
It is an automatic metric that quantifies performance of the holistic architecture.
It provides a soft generalization of the accuracy that takes ambiguities of different concepts into account using set membership measure.
It offers two generalizations
That many human answers takes the max score and high score will be granted to that answer if it is similar to a human answer