Open Data for Deep Learning

https://deeplearning4j.org/opendata#open-data-for-deep-learning

Here you’ll find an organized list of interesting, high-quality datasets for machine learning research. We welcome your contributions for curating this list! You can find other lists of such datasets on Wikipedia, for example.

Recent Additions

Natural-Image Datasets

  • MNIST: handwritten digits: The most commonly used sanity check. Dataset of 25x25, centered, B&W handwritten digits. It is an easy task — just because something works on MNIST, doesn’t mean it works.
  • CIFAR10 / CIFAR100: 32x32 color images with 10 / 100 categories. Not commonly used anymore, though once again, can be an interesting sanity check.
  • Caltech 101: Pictures of objects belonging to 101 categories.
  • Caltech 256: Pictures of objects belonging to 256 categories.
  • STL-10 dataset: is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. Like CIFAR-10 with some modifications.
  • The Street View House Numbers (SVHN): House numbers from Google Street View. Think of this as recurrent MNIST in the wild.
  • NORB: Binocular images of toy figurines under various illumination and pose.
  • Pascal VOC: Generic image Segmentation / classification — not terribly useful for building real-world image annotation, but great for baselines
  • Labelme: A large dataset of annotated images.
  • ImageNet: The de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category; WordNet; hierarchy from ImageNet.
  • LSUN: Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition.
  • MS COCO: Generic image understanding / captioning, with an associated competition.
  • COIL 20: Different objects imaged at every angle in a 360 rotation.
  • COIL100 : Different objects imaged at every angle in a 360 rotation.
  • Google’s Open Images: A collection of 9 million URLs to images “that have been annotated with labels spanning over 6,000 categories” under Creative Commons.

Geospatial data

  • OpenStreetMap: Vector data for the entire planet under a free license. It contains (an older version of) the US Census Bureau’s data.
  • Landsat8: Satellite shots of the entire Earth surface, updated every several weeks.
  • NEXRAD: Doppler radar scans of atmospheric conditions in the US.

Artificial Datasets

Facial Datasets

  • Labelled Faces in the Wild: 13,000 cropped facial regions (using; Viola-Jones that have been labeled with a name identifier. A subset of the people present have two images in the dataset — it’s quite common for people to train facial matching systems here.
  • UMD Faces Annotated dataset of 367,920 faces of 8,501 subjects.
  • CASIA WebFace Facial dataset of 453,453 images over 10,575 identities after face detection. Requires some filtering for quality.
  • MS-Celeb-1M 1 million images of celebrities from around the world. Requires some filtering for best results on deep networks.
  • Olivetti: A few images of several different people.
  • Multi-Pie: The CMU Multi-PIE Face Database
  • Face-in-Action
  • JACFEE: Japanese and Caucasian Facial Expressions of Emotion
  • FERET: The Facial Recognition Technology Database
  • mmifacedb: MMI Facial Expression Database
  • IndianFaceDatabase
  • The Yale Face Database and The Yale Face Database B).

Video Datasets

  • Youtube-8M: A large and diverse labeled video dataset for video understanding research.

Text Datasets

  • 20 newsgroups: Classification task, mapping word occurences to newsgroup ID. One of the classic datasets for text classification) usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.
  • Reuters News dataset: (Older) purely classification-based dataset with text from the newswire. Commonly used in tutorial.
  • Penn Treebank: Used for next word prediction or next character prediction.
  • UCI’s Spambase: (Older) classic spam email dataset from the famous UCI Machine Learning Repository. Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized spam filtering.
  • Broadcast News: Large text dataset, classically used for next word prediction.
  • Text Classification Datasets: From; Zhang et al., 2015; An extensive set of eight datasets for text classification. These are the benchmark for new text classification baselines. Sample size of 120K to 3.6M, ranging from binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo! and AG.
  • WikiText: A large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind.
  • SQuAD: The Stanford Question Answering Dataset — broadly useful question answering and reading comprehension dataset, where every answer to a question is posed as a segment of text.
  • Billion Words dataset: A large general-purpose language modeling dataset. Often used to train distributed word representations such as word2vec.
  • Common Crawl: Petabyte-scale crawl of the web — most frequently used for learning word embeddings. Available for free from Amazon S3. Can also be useful as a network dataset for it’s a crawl of the WWW.
  • Google Books Ngrams: Successive words from Google books. Offers a simple method to explore when a word first entered wide usage.

Question answering

  • Maluuba News QA Dataset: 120K Q&A pairs on CNN news articles.
  • Quora Question Pairs: first dataset release from Quora containing duplicate / semantic similarity labels.
  • CMU Q/A Dataset: Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles.
  • Maluuba goal-oriented dialogue: Procedural conversational dataset where the dialogue aims at accomplishing a task or taking a decision. Often used to work on chat bots.
  • bAbi: Synthetic reading comprehension and question answering datasets from Facebook AI Research (FAIR).
  • The Children’s Book Test: Baseline of (Question + context, Answer) pairs extracted from Children’s books available through Project Gutenberg. Useful for question-answering (reading comprehension) and factoid look-up.

Sentiment

  • Multidomain sentiment analysis dataset An older, academic dataset.
  • IMDB: An older, relatively small dataset for binary sentiment classification. Fallen out of favor for benchmarks in the literature in lieu of larger datasets.
  • Stanford Sentiment Treebank: Standard sentiment dataset with fine-grained sentiment annotations at every node of each sentence’s parse tree.

Recommendation and ranking systems

  • Movielens: Movie ratings dataset from the Movielens website, in various sizes ranging from demo to mid-size.
  • Million Song Dataset: Large, metadata-rich, open source dataset on Kaggle that can be good for people experimenting with hybrid recommendation systems.
  • Last.fm: Music recommendation dataset with access to underlying social network and other metadata that can be useful for hybrid systems.
  • Book-Crossing dataset:: From the Book-Crossing community. Contains 278,858 users providing 1,149,780 ratings about 271,379 books.
  • Jester: 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
  • Netflix Prize:: Netflix released an anonymized version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies. First major Kaggle style data challenge. Only available unofficially, as privacy issues arose.

Networks and Graphs

  • Amazon Co-Purchasing: Amazon Reviews crawled data from “the users who bought this also bought…” section of Amazon, as well as Amazon review data for related products. Good for experimenting with recommendation systems in networks.
  • Friendster Social Network Dataset: Before their pivot as a gaming website, Friendster released anonymized data in the form of friends lists for 103,750,348 users.

Speech Datasets

  • 2000 HUB5 English: English-only speech data used most recently in the Deep Speech paper from Baidu.
  • LibriSpeech: Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by chapters of the book containing both the text and the speech.
  • VoxForge: Clean speech dataset of accented english. Useful for instances in which you expect to need robustness to different accents or intonations.
  • TIMIT: English-only speech recognition dataset.
  • CHIME: Noisy speech recognition challenge dataset. Dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
  • TED-LIUM: Audio transcription of TED talks. 1495 TED talks audio recordings along with full text transcriptions of those recordings.

Symbolic Music Datasets

Miscellaneous Datasets

Health & Biology Data

Government & statistics data

Thanks to deeplearning.net and Luke de Oliveira for many of these links and dataset descriptions. Any suggestions of open data sets we should include for the Deeplearning4j community are welcome!


via: http://sidgan.me/technical/2016/01/09/Exploring-Datasets



PASCAL SENTENCE DATASET

  • Link: http://vision.cs.uiuc.edu/pascal-sentences/
PASCAL stands for Pattern Analysis Statistical Modeling and Computational learning. It has 3 tasks:
  • Image Classification
  • Object Detection
  • Object Segmentation
The dataset has 20 classes, including aeroplane, bicycle, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, train, TV.
For selecting this dataset, no quality filter was applied, the complete dataset has been directly downloaded from Flickr. Because of no filtering, there are complex scenes, scaling, view points of different objects, unnatural lighting. Training set consists of 10,103 images with 23,374 objects such that there are approximately 500 training objects per category. Each object is segmented completely with its bounding box. Standard evaluation method of average precision per class is used with train/test and validation splits.
Some challenges include:
  • Action Classification Taster Challenge
    • Given a bounding box predict whether the person in the bounding box is performing an action or not.
    • Action classes include, Calling, playing an instrument, reading, riding bike, riding horse, running, taking a photo, using the computer, walking, jumping and ?other? class.
  • Given the bounding box of a person, predict the spatial position of the head, hands and feet. This encourages research on image interpretation.

FLICKR 8K

  • Publically available dataset: http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html
  • The Flickr 8K dataset includes images obtained from the Flickr website.
  • University of Illinois at Urbana, Champaign has the sole link of this dataset.
  • The images do not contain any famous person or place so that the entire image can be learnt based on all the different objects in the image.

FLICKR 30K

  • This is an extension to the Flickr 8K.
  • Link: http://shannon.cs.illinois.edu/DenotationGraph/
Image search is based on associating the query text with the tags of the image that help to identify the image. Identification of images then becomes multi-label classification problem of associating images with individual words or tags. It is a much harder problem to automatically associate images with complete and novel descriptions of images such as captions.
Multiple captions for each image are taken because there is a great amount of variance that is possible in the captions that can be written to describe a single image. This also helps satisfy the dynamic nature of images. There are multiple objects in the image but in a caption usually the main subject and either one or two of the secondary subjects are included in the caption.

MS COCO

  • Link: http://mscoco.org/
This dataset does not focus on iconic views that contain just one view of a single object such that objects in the background, partially occluded and amidst clutter are also present and are important for image retrieval tasks. Iconic images contain a single subject within the image frame that is clearly defined and occupies most space within the image frame. The object would be complete or at least easily recognizable. Detailed spatial understanding of the object layout is a core component of scene analysis. An objects spatial location can be defined coarsely using a bounding box or with precise pixel level segmentations. Image datasets like ground truth stereo and optical flow datasets promote tracking of movement of one object from one frame to another.
Three core problems in scene understanding that are presented in this paper are:
  • Detecting non-iconic views or non-canonical perspectives
  • Contextual reasoning between objects
  • 2D localization of objects
It has instance level segmentation which means that the dataset has every instance of every object category labelled and fully segmented.
Images features:
  • Pair of objects in conjunction with images retrieved using scene based queries.
  • Labeling is done as the image containing a particular objet category using hierarchal labelling
  • Individual instances are labeled and verified and finally segmented
Images count:
  • 91 common object categories with 82 having more than 50000 labelled instances.
  • 2,500,000 instances in 328,000 images
  • Fewer categories but more instances per category, which enables better learning and makes this a richer dataset on which the max score is less as compared to PASCAL dataset.
  • Number of labeled instances per image help in learning contextual information
Datasets can be addressed to one out of three kinds of problems:
  • Image classification
    • Binary labels that indicate if a image belongs to a category or not.
  • Object detection
    • Detect if an object is present and if present to what class of objects does it belong to.
    • After detecting an object, localize its position within the image
    • It is difficult to detect objects that are not in their natural surroundings
  • Semantic scene labeling
    • Each pixel of an image is to be labeled as belonging to a category
    • This enables the labeling of objects for which instances are hard to define

DAQUAR Dataset

It is a dataset for question answering (natural language sentences) based on real world images( which include indoor scenes).

Vision and Language

  • A set of images and questions about their content is presented.
  • DAQUAR questions contains 1088 nouns, while answers contain 803 nouns along with other POS.

Common sense knowledge

Search space can be restricted based on understanding non-visual cues from the questions.

Question Answering

Questions are asked based on spatial concepts within different frame of reference.

Performance Evaluation

Settling on the perfect metric is difficult because:
  • Automation
    • Understand natural language and try to gauge answers
    • Individual judging of every answer is a difficult task so an automatic approximation is reached.
  • Ambiguity
    • The gradual category membership of of human perception is variable and brings ambiguity
    • Multiple interpretations of questions are possible leading to many correct answers
  • Coverage
    • The equivalence class among the answers is considered by assigning similar scores to members of the same class.

Metrics

  • WUPS SCORE
    • It is an automatic metric that quantifies performance of the holistic architecture.
    • It provides a soft generalization of the accuracy that takes ambiguities of different concepts into account using set membership measure.
    • It offers two generalizations
      • Interpretation metric
        • That many human answers takes the max score and high score will be granted to that answer if it is similar to a human answer
      • Extension
        • Use vector based representations of the answers
        • The coverage issues are less pronounced

Comments