Posts Tagged ‘Vision’

COCO

COCO:

This image dataset contains image data suitable for object detection and segmentation. It contains 5 annotation types for Object Detection, Keypoint Detection, Stuff Segmentation, Panoptic Segmentation and Image Captioning all explained in details on the data format section of the dataset page (http://cocodataset.org/#format-data).

Here is some information regarding the latest version of this dataset:

  • Number of images in the dataset: 330,000 images while more than 200,000 are labeled (roughly equal halves for training and validation+test)

  • Number of classes: 80 object categories, 91 stuff categories

  • Image resolution: 640×480

More details and links for download can be found on the dataset and challenge page http://cocodataset.org/#home and http://cocodataset.org/#overview.

If you use this dataset:

Please make sure to read Terms of Use available on http://cocodataset.org/#termsofuse.

Please make sure to cite the paper:

T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. Zitnick, P, Microsoft COCO: Common Objects in Context. Dollar. European Conference on Computer Vision (ECCV), 2014.

keywords: Vision, Image, Object Detection, Segmentation

SALICON

SALICON:

This image dataset which is also a mouse tracking dataset, has been created from a subset of images from a parent dataset called MS COCO 2014 (available on http://cocodataset.org/#home) with an additional annotation type “fixation”. The visual attentional data for this dataset is collected by using mouse tracking methods. The research work related to this dataset aimed to find answers about human visual attention and decision making. In their paper which is available in bellow, they evaluated their mouse tracking method by comparing the results with eye-tracking.

Here is some information regarding the latest version of this dataset:

  • Number of images in the dataset: 20,000 (10,000 images for training set, 5000 images for validation, 5000 for test set)

  • Number of classes: 80

  • Image resolution: 640×480

More details and links for download can be found on the dataset page http://salicon.net/ and SALICON challenge 2017 page http://salicon.net/challenge-2017/.

You might also be interested to use the SALICON API Python package available on GitHub https://github.com/NUS-VIP/salicon-api.

If you use this dataset:

Please make sure to read Terms of Use available on http://salicon.net/challenge-2017/.

Please make sure to cite the paper:

M. Jiang, S. Huang, J. Duan, Q. Zhao, SALICON: Saliency in Context. CVPR 2015.

keywords: Vision, Image, Classification, Scene, Saliency Analysis

SUN

SUN:

This dataset contains thousands of color images for scenes recognition provided by Princeton University. The images include environmental scenes, places and objects. To create the dataset, WordNet English dictionary is used to find any nouns completing the sentence “I am in -a place-“ or “Let’s go to -the place-“ and data samples are manually categorized. The number of images per category are different for this dataset with the minimum of 100 images per category for the LSUN397 version.

Different versions are available for the dataset. Here is some information about LSUN397 dataset:

  • Number of images in the dataset: 16,873

  • Number of classes: 397 (Abbey, Access_road, etc.)

Here is some information regarding the latest version of this dataset:

  • Number of images in the dataset: 131,067

  • Number of classes: 908 scene categories and 3819 object categories

More details and links of download are available on the dataset pages https://vision.princeton.edu/projects/2010/SUN/ and https://groups.csail.mit.edu/vision/SUN/. Recommendations for training and testing split are also available in the mentioned pages.

If you use this dataset, make sure to cite these two papers:

J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun Database: Large-scale Scene Recognition from Abbey to Zoo, IEEE Conference on Computer Vision and Pattern Recognition, 2010.

J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, Sun Database: Exploring a Large Collection of Scene Categories. International Journal of Computer Vision (IJCV), 2014.

keywords: VisionImage, Classification, Scene, Object Detection

LSUN

LSUN:

This dataset contains millions of color images for scenes and objects which is far bigger than ImageNet dataset. The labels for this dataset are available based on human’s effort for labeling in conjunction with several different image classification models. The images are from parent databases Pascal Voc 2012 and 10 Million Images for 10 Scene Categories.

Here is some information regarding the LSUN dataset:

  • Number of images in the dataset: More than 59 million and still growing

  • Number of classes: 10 scene categories and 20 object categories

  1. Scene categories (bedroom, bridge, church_outdoor, classroom, conference_room, dining_room, kitchen, living_room, restaurant, tower)

20 object categories (airplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining_table, dog, horse, motorbike, person, potted_plant, sheep, sofa, train, tv-monitor)

The dataset can be downloaded either from GitHub https://github.com/fyu/lsun or the categories lists on http://tigress-web.princeton.edu/~fy/lsun/public/release/. More details are available on the dataset page http://www.yf.io/p/lsun.

If you use this dataset, make sure to cite the paper:

Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser and Jianxiong Xiao. Corr, LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop. abs/1506.03365, 2015

keywords: Vision, Image, Classification, Scene, Object Detection

DeepFashion

DeepFashion

This dataset contains images of clothing items while each image is labeled with 50 categories and annotated with 1000 attributes, bounding box and clothing landmarks in different poses. Four datasets are developed according to the DeepFashion dataset including Attribute Prediction, Consumer-to-shop Clothes Retrieval, In-shop Clothes Retrieval and Landmark Detection in which only Attribute Prediction is available without password requests. All the other datasets mentioned need to request for a password to unzip the data files and the access would be available after signing an Agreement. All these datasets are available for academic research and any commercial use is prohibited. More details about the datasets and download instructions can be found on http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html. Attribute Prediction dataset which contains 289,222 number of images, can be downloaded from http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion/AttributePrediction.html.

In-shop-Clothes Retrieval dataset which contains 7,982 images can be downloaded from http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion/InShopRetrieval.html.

Consumer-to-shop Clothes dataset which contains 33,881 number of images, can be downloaded from http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion/Consumer2ShopRetrieval.html.

Finally, Fashion Landmark Detection dataset which contains 123,016 number of images, can be downloaded from http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion/LandmarkDetection.html.

Here is some information regarding the DeepFashion dataset:

  • Number of images in the dataset: More than 800,000 (60,000 images for the training set and 10,000 images for the test set)

  • Number of classes: 50 categories

If you use this dataset:

Make sure to follow the Terms of Use according to the Agreement about the datasets and use the data for academic research purposes only.

Make sure to cite the papers:

Z. Liu, P. Luo. S. Qiu, X. Wang, X. Tang, Powering Robust Clothes Recognition and Retrieval with Rich Anotations, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Z. Liu, S. Yan, P. Luo, X. Wang, X. Tang, Fashion Landmark Detection In The Wild, European Conference on Computer Vision (ECCV), 2016.

Keywords: VisionImage, Classification, Fashion, Clothes Recognition, Clothes Detection

Fashion MNIST

Fashion MNIST:

This dataset contains grayscale images for clothing generated by Zalando (https://jobs.zalando.com/tech/). The dataset is created to be a substitute for the original MNIST dataset for machine learning algorithms. This substitution seems necessary because achieving very high classification accuracies is easy by classical machine learning algorithms. Also, MNIST might have been overused. As a result, Fashion MNIST shares the same image size, training and test sizes and number of classes with original MNIST.

Here is some information regarding the Fashion MNIST dataset:

  • Number of images in the dataset: 70,000 (60,000 images for the training set and 10,000 images for the test set)

  • Image size: 28×28

  • Number of classes: 10 (T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot)

4 data files are available for download from https://github.com/zalandoresearch/fashion-mnist which contain training set images, training set labels, test set images and test set labels. Instead of downloading the dataset, you might clone the GitHub repository in the same address provided above. More details and loading commands can be found in the same GitHub repository. You might also be interested to take a look at the Kaggle page https://www.kaggle.com/zalando-research/fashionmnist/home.

If you use this dataset, make sure to cite the paper:

Han Xiao, Kashif Rasul, Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms, 2017.

keywords: Vision, Image, Classification, Fashion, Clothes Recognition, Clothes Detection

MNIST

MNIST:

This dataset contains grayscale images for handwritten digits in which half of the training set and half of the test set are collected among Census Bureau employees and the second half of each training and test sets are collected among high school students. The dataset is a subset of images from two parent datasets NIST’s Special Database 3 and Special Database 1.

Here is some information regarding the MNIST dataset:

  • Number of images in the dataset: 70,000 (60,000 images for the training set: 30,000 from NIST’s Special Database 3 and 30,000 from NIST’s Special Database 1. 10,000 images for the test set: 5000 from Special Database 3 and 5000 from Special Database 1)

  • Image size: 28×28

  • Number of classes: 10 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

4 data files are available for download from http://yann.lecun.com/exdb/mnist/ which contain training set images, training set labels, test set images and test set labels. Please note that the images in this dataset do not have the image format and the user is supposed to write a short code to read them. The details about the file format is available on the mentioned address.

keywords: Vision, Image, Classification, Handwritten Digits

CIFAR-10 & CIFAR-100

CIFAR-10 & CIFAR-100:

These two datasets are labeled images from a parent dataset called Tiny Images Dataset (which is available on http://horatio.cs.nyu.edu/mit/tiny/data/index.html).

CIFAR-10:

  • Number of images in the dataset: 60,000 (50,000 images for training divided into 5 batches and 10,000 images for test in one batch)

  • Image size: 32×32

  • Number of classes: 10 (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck)

3 different versions are available for this dataset each suitable for either Python, Matlab or C programming, and they can be downloaded from https://www.cs.toronto.edu/~kriz/cifar.html.

If you use this dataset, make sure to cite the following tech report.

Alex Krizhevsky, Learning multiple layers of features from tiny images, 2009.

CIFAR-100:

  • Number of images in the dataset: 60,000 (50,000 images for training while 500 images belong to each class and 10,000 images for test while 100 images belong to each class for test)

  • Image size: 32×32

  • Number of classes: 100 , provided in detail in bellow (copied from http://www.cs.toronto.edu/~kriz/cifar.html)

The 100 classes belong to 20 Superclasses that determines the “coarse” label and the “fine” label refers to the class that image belongs to.

Superclass and Classes:

aquatic mammals: beaver, dolphin, otter, seal, whale

fish: aquarium fish, flatfish, ray, shark, trout

flowers: orchids, poppies, roses, sunflowers, tulips

food containers: bottles, bowls, cans, cups, plates

fruit and vegetables: apples, mushrooms, oranges, pears, sweet peppers

household electrical devices: clock, computer keyboard, lamp, telephone, television

household furniture: bed, chair, couch, table, wardrobe

insects: bee, beetle, butterfly, caterpillar, cockroach

large carnivores: bear, leopard, lion, tiger, wolf

large man-made outdoor things: bridge, castle, house, road, skyscraper

large natural outdoor scenes: cloud, forest, mountain, plain, sea

large omnivores and herbivores: camel, cattle, chimpanzee, elephant, kangaroo

medium-sized mammals: fox, porcupine, possum, raccoon, skunk

non-insect invertebrates: crab, lobster, snail, spider, worm

people: baby, boy, girl, man, woman

reptiles: crocodile, dinosaur, lizard, snake, turtle

small mammals: hamster, mouse, rabbit, shrew, squirrel

trees: maple, oak, palm, pine, willow

vehicles 1: bicycle, bus, motorcycle, pickup truck, train

vehicles 2: lawn-mower, rocket, streetcar, tank, tractor

3 different versions are available for this dataset each suitable for either Python, Matlab or C programming, and they can be downloaded from https://www.cs.toronto.edu/~kriz/cifar.html.

If you use this dataset, make sure to cite the following tech report.

Alex Krizhevsky, Learning multiple layers of features from tiny images, 2009.

Keywords: Vision, Image, Classification, Natural Images

ImageNet

ImageNet:

This dataset contains images which are organized according to the WordNet hierarchy (WorldNet 3.0) in which every node refers to up to thousands of images. Each concept in WorldNet is described by synonym sets (synsets) which are words and phrases. ImageNet aims to have 1000 images per synset on average.

Because the images are subject to copyright and ImageNet does not own the images, only thumbnails and URLs of the images are available by ImageNet. The original image dataset can only be provided to students and researchers under certain conditions. Also image data is available for download (only for educational purposes) through Visual Recognition Challenges by registration.

The following information is from http://image-net.org/about-stats

  • Number of non-empty synsets: 21841

  • Number of images: 14,197,122

  • Number of images with bounding box annotations: 1,034,908

  • Number of synsets with SIFT features: 1000

  • Number of images with SIFT features: 1.2 million

  • Number of categories: 22,000 with 500-1000 images per category

The ImageNet URLs can be downloaded from the following link:

http://image-net.org/download-imageurls

To access the WorldNet hierarchy and the WorldNet documentation, please refer to the following link:

http://image-net.org/download-API

The complete information about the dataset and contact link for download can be found at: http://image-net.org/about-overview and http://image-net.org/

Find information about the ImageNet Object Localization Challenge available on Kaggle:

https://www.kaggle.com/c/imagenet-object-localization-challenge

keywords: Vision, Image, Classification, Natural Images

UCF50 & UCF101

UCF50 & UCF101:

These two datasets contain realistic action recognition videos collected from Youtube with large variations in motion, pose, scales and conditions. The video files are categorized in groups with similar features, for example same person in the videos, similar viewpoints, background, etc.

UCF50

Here is some information regarding this dataset:

  • Number of Categories: 50 categories provided in bellow (copied from the original page)

  • Number of Groups: 25 (more than 4 clips for every action in each group)

Categories: “Baseball Pitch, Basketball Shooting, Bench Press, Biking, Biking, Billiards Shot,Breaststroke, Clean and Jerk, Diving, Drumming, Fencing, Golf Swing, Playing Guitar, High Jump, Horse Race, Horse Riding, Hula Hoop, Javelin Throw, Juggling Balls, Jump Rope, Jumping Jack, Kayaking, Lunges, Military Parade, Mixing Batter, Nun chucks, Playing Piano, Pizza Tossing, Pole Vault, Pommel Horse, Pull Ups, Punch, Push Ups, Rock Climbing Indoor, Rope Climbing, Rowing, Salsa Spins, Skate Boarding, Skiing, Skijet, Soccer Juggling, Swing, Playing Tabla, TaiChi, Tennis Swing, Trampoline Jumping, Playing Violin, Volleyball Spiking, Walking with a dog, and Yo Yo”.

More details about this dataset and links of download can be found on http://crcv.ucf.edu/data/UCF50.php.

If you use this dataset, make sure to cite the paper:

K. K. Reddy, M. Shah, Recognizing 50 Human Action Categories of Web Videos, Machine Vision and Applications Journal (MVAP), 2012.

UCF101

Here is some information regarding this dataset:

  • Number of Video Clips: 13320

  • Number of Categories: 101 categories provided in bellow (copied from the original page)

  • Number of Groups: 25 (4-7 clips for every action in each group)

Categories: “Apply Eye Makeup, Apply Lipstick, Archery, Baby Crawling, Balance Beam, Band Marching, Baseball Pitch, Basketball Shooting, Basketball Dunk, Bench Press, Biking, Billiards Shot, Blow Dry Hair, Blowing Candles, Body Weight Squats, Bowling, Boxing Punching Bag, Boxing Speed Bag, Breaststroke, Brushing Teeth, Clean and Jerk, Cliff Diving, Cricket Bowling, Cricket Shot, Cutting In Kitchen, Diving, Drumming, Fencing, Field Hockey Penalty, Floor Gymnastics, Frisbee Catch, Front Crawl, Golf Swing, Haircut, Hammer Throw, Hammering, Handstand Pushups, Handstand Walking, Head Massage, High Jump, Horse Race, Horse Riding, Hula Hoop, Ice Dancing, Javelin Throw, Juggling Balls, Jump Rope, Jumping Jack, Kayaking, Knitting, Long Jump, Lunges, Military Parade, Mixing Batter, Mopping Floor, Nun chucks, Parallel Bars, Pizza Tossing, Playing Guitar, Playing Piano, Playing Tabla, Playing Violin, Playing Cello, Playing Daf, Playing Dhol, Playing Flute, Playing Sitar, Pole Vault, Pommel Horse, Pull Ups, Punch, Push Ups, Rafting, Rock Climbing Indoor, Rope Climbing, Rowing, Salsa Spins, Shaving Beard, Shotput, Skate Boarding, Skiing, Skijet, Sky Diving, Soccer Juggling, Soccer Penalty, Still Rings, Sumo Wrestling, Surfing, Swing, Table Tennis Shot, Tai Chi, Tennis Swing, Throw Discus, Trampoline Jumping, Typing, Uneven Bars, Volleyball Spiking, Walking with a dog, Wall Pushups, Writing On Board, Yo Yo”.

More details about this dataset and links of download can be found on http://crcv.ucf.edu/data/UCF101.php.

If you use this dataset, please refer to the technical report:

K. Soomro, A. Roshan Zamir, M. Shah, UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild, CRCV-TR-12-01, 2012.

Keywords: Vision, Action Recognition, Time Series, Video, Youtube, Sports, Human Interaction, Body Motion