Biology and Health Datasets

Chest X-Ray Images Pneumonia

Chest X-Ray Images (Pneumonia)

This dataset contains X-Ray images of patients suffering from Pneumonia in comparison with X-Ray images referring to normal condition. For more information please refer to https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia/home.The data files can be downloaded separately for training, testing and validation available on Kaggle https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia.

Here is some information regarding this dataset:

  • Number of images in the dataset: 5863 images (5216 images for training, 624 images for test and 16 images for validation)

  • Number of classes: 2 (Normal or Pneumonia)

  • Image resolution is different for the image samples.

If you use this dataset:

Please make sure to read the License carefully which is available on https://creativecommons.org/licenses/by/4.0/.

Please make sure to cite the paper:

D. S. Kermany, M. Goldbaum, W. Cai, et al. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell, 2018.

keywords: Vision, Image, Biology and Health, X-Ray, Classification

HAM10000

HAM10000:

This dataset contains 10015 dermatoscopic images of pigmented lesions for patients in 7 diagnostic categories. For more than half of the subjects, the diagnosis was confirmed through histopathology and for the rest of the patience through follow-up examinations, expert consensus, or by in-vivo confocal microscopy. More information about the dataset and the diagnosis categories, features and patience conditions besides the links to download the dataset can be found on either Harvard Dataverse https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T or on Kaggle https://www.kaggle.com/kmader/skin-cancer-mnist-ham10000/home. This dataset is for non-commercial use only.

Here is some information regarding the dataset:

Number of Images: 10015 dermatoscopic images

Number of categories: 7 diagnostic categories of pigmented lesions

If you use this dataset:

Make sure to read the Terms of Use carefully, which is available on the same page and needs confirmation before downloading the data files. This dataset is for non-commercial use only.

Make sure to cite the dataset:

Tschandl, Philipp, 2018, The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions, https://doi.org/10.7910/DVN/DBW86T, Harvard Dataverse, V1, UNF:6:IQTf5Cb+3EzwZ95U5r0hnQ== [fileUNF]

keywords: Vision, Image, Biology and Health, Classification

CBIS-DDSM

CBIS-DDSM: Curated Breast Imaging Subset of DDSM:

This dataset contains images for screening Mammography and is a subset of a DDSM dataset (Digital Database for Screening Mammography http://marathon.csee.usf.edu/Mammography/Database.html). CBIS-DDSM contains images of cases with three conditions of breast cancer (normal, benign, and malignant). The dataset also includes ROI segmentation and bounding boxes and pathologic diagnosis for the training data. This dataset can be downloaded from the Data Access section on https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM#97542eefbc8e4234a95231cbcd86cb1d.

Here is some information regarding this dataset:

  • Number of images in the dataset: 10,239

  • Number of subjects: 6671

  • Total Images Size in GB: 163.6

If you use this dataset:

Make sure to cite these papers:

R. S. Lee, F. Gimenez, A. Hoogi, D. Rubin. Curated Breast Imaging Subset of DDSM. The Cancer Imaging Archive, 2016.

R. S. Lee, F. Gimenez, A. Hoogi, K. K. Miyake, M. Gorovoy, D. L. Rubin. A Curated Mammography Data set for Use in Computer-aided Detection and Diagnosis Research. Scientific Data Volume 4, Article number: 170177, 2017.

K. Clark, B. Vendt, K. Smith, J. Freymann, J. Kirby, P. Koppel, S. Moore, S. Phillips, D. Maffitt, M. Pringle, L. Tarbox, F. Prior. The Cancer Imaging Archive(TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, 2013.

Make sure to follow the Policy and Terms of Use available on https://creativecommons.org/licenses/by/3.0/ and https://wiki.cancerimagingarchive.net/display/Public/Data+Usage+Policies+and+Restrictions.

keywords: Vision, Image, Biology and Health, CT, Classification, Cancer

NLST: National Lung Screening Trial

NLST: National Lung Screening Trial:

This dataset contains images of the screening tests of patients suffering from lung cancer collected during a controlled clinical trial. The patients participated in a study for about 6.5 years of follow-up, while they were randomly divided into two groups of either receiving a low-dose helical CT screening or a single-view chest radiography. The dataset is not public, and a research proposal is required to gain access and download the dataset. To obtain more information regarding the research details or to request to gain access to the dataset, please refer to https://wiki.cancerimagingarchive.net/display/NLST/National+Lung+Screening+Trial#4c242d6186bf4aff949bb62cb2ab60da or https://biometry.nci.nih.gov/cdas/learn/nlst/images/. Additionally, a detailed description regarding the dataset participants, CT screening and abnormalities, X-Ray screening and abnormalities, diagnostic procedures, treatment, cause of death and so many other useful information about the dataset is available on https://biometry.nci.nih.gov/cdas/datasets/nlst/.

Here is some information regarding this dataset:

  • Number of images in the dataset: 21,082,502

  • Number of subjects: 26,254

  • Total Images Size in TB: 11.3

If you use this dataset:

Make sure to provide proper citations according to the Citations & Data Usage Policy available on the same page provided above.

Make sure to follow the Policy and Terms of Use even after receiving access to use the dataset for your own research purpose https://wiki.cancerimagingarchive.net/display/Public/Data+Usage+Policies+and+Restrictions.

keywords: VisionImage, Biology and Health, CT, Classification, Cancer

Human Protein Atlas Image

Human Protein Atlas Image:

This dataset contains protein images of human body available from the Human Protein Atlas Image Classification Competition on Kaggle or from The Human Protein Atlas page https://www.proteinatlas.org/cell. The dataset might be either used for the Kaggle Competition, research and education and non-commercial purposes. Please refer to the competition rules on Kaggle for more information about the Terms of Use and the Rules regarding the dataset https://www.kaggle.com/c/human-protein-atlas-image-classification/rules.

Here is some information regarding this dataset:

  • Number of classes: 28 categories as integers from 0 to 27, each referring to a human protein.

  • Available separate datafiles for training and testing with three resolutions: 512×512 PNG, 2048×2048 TIFF, 3072×3072 TIFF

If you use this dataset:

Make sure to use the dataset for non-commercial purposes only.

keywords: Vision, Image, Biology and Health, Classification, Protein, Cell, Object Detection

WESAD: Wearable Stress and Affect Detection

WESAD: Wearable Stress and Affect Detection:

This dataset contains physiological and motion data of 15 subjects collected by using a wrist and a chest device worn. The chest-worn device records ECG, Electrodermal Activity, Electromyogram, Respiration, Body Temperature and Three-access Acceleration and the wrist-worn device records Blood Volume Pulse, Electrodermal activity, Body Temperature and Three-axis Acceleration. More details about the dataset and the links of download can be found on https://archive.ics.uci.edu/ml/datasets/WESAD+%28Wearable+Stress+and+Affect+Detection%29.

Here is some information regarding the dataset:

  • Number of Instances: 63,000,000

  • Number of Attributes: 12

  • Number of Subjects: 15

If you use this dataset:

Make sure to use the data for academic research and non-commercial purposes only.

Make sure to cite the paper:

P. Schmidt, A. Reiss, R. Duerchen, C. Marberger, K. V. Laerhoven, Intorducing WESAD: a Multimodal Dataset for Wearable Stress and Affect Detection, International Conference on Multimodal Interaction (ICMI), 2018.

Keywords: Biology and Health, Classification, Regression, Stress Detection, Motion, Time Series

Indian Liver Patient Records

Indian Liver Patient Records:

This dataset contains the records regarding the liver conditions of people into two categories of liver patients and non-liver patients. The dataset was collected with the goal of providing a benchmark for prediction algorithms to help in diagnosing liver diseases. More information about the dataset and links of download can be found on Kaggle https://www.kaggle.com/uciml/indian-liver-patient-records/home or on UCI ML Repository on https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset).

  • Number of patients: 583 (441 male and 142 female)

  • Number of categories: 2 (liver patients: 416 and non-liver patients 167)

  • Number of attributes: 10 (including Age, Gender, etc.)

If you use this dataset:

Make sure to provide acknowledgements and citation to the UCI Repository according to the Citation Policy https://archive.ics.uci.edu/ml/citation_policy.html.

Keywords: Biology and Health, Liver, Classification

Megascale Cell-Cell Similarity Network

Megascale Cell-Cell Similarity Network:

This dataset contains information for the mouse brain cells and is a single-cell RNA-sequecing dataset. The dataset is preprocessed and the unwanted sources of variations are filtered out. Mouse brain cells are represented by nodes and the edges refer to nearest neighbor similarities between cells according to similar gene expressions.

Here is some information regarding this dataset:

  • Number of Nodes: 1,018,524

  • Number of Edges: 24,735,503

More information about the dataset and links of download are available on SNAP http://snap.stanford.edu/biodata/datasets/10023/10023-CC-Neuron.html.

If you use this dataset, make sure to cite the papers:

M. Zitnik, R. Sosic, J. Leskovec, Prioritizing Network Communities, Nature Communications, 2018.

G. X. Zheng et al, Massively Parallel Digital Transcriptional Profiling of Single Cells, Nature Communations, 2017.

M. Zitnik, R. Sosic, S. Maheshwari, J. Leskovec, BioSNAP Datasets: Biomedical Network Dataset Collection, 2018. http://snap.stanford.edu/biodata/

Keywords: Network, Biology and Health, RNA

Infectious Disease Spread: Flu

Infectious Disease Spread: Flu

This dataset contains information about the Flu virus spreading between healthy and infected students by having close interactions. The nodes refer to almost the entire school population and the edges refer to the interactions with different durations. Most of the contacts are short time. More information about the dataset and links of download can be found on http://sing.stanford.edu/flu/ and the two publications on the dataset.

Here is some information regarding the dataset:

  • Number of Nodes: 788 individuals (655 students and 73 teachers, 55 staff, 5 other)

  • Number of Edges: 2,148,199 Close Proximity Records (762,868 interactions with a mean duration of 2.8 CPRs (~1min) or 118,291 interactions with mean duration of 18.7 CPRs (~6min)

Detailed information about the dataset can be found on the papers:

M. Salathe, M. Kazandjieva, J. W. Lee, P. Levis, M. W. Feldman, J. H. jones, A High-Resolution Human Contact Network for Infectious Disease Transmission, In Proceedings of National Academy of Science (PNAS), 2010.

m. Kazandjieva, J. W. Lee, M. Salathe, M. W. Feldman, J. H. Jones, P. Levis, Experiences in Measuring a Human Contact Network for Epidemiology Research, Proceedings of the ACM Workshop on Hot Topics in Embedded Networked Sensors (HotEmNets), 2010.

Keywords: Network, Biology and Health, Spreading Phenomena, Epidemic Process, Disease, Flu, Time Series