Skip to content

Preprocessed datasets

In H2O Hydrogen Torch, you can access preprocessed datasets to explore supported problem types.

Import preprocessed dataset

To import a preprocessed dataset to H2O Hydrogen Torch, consider the following instructions:

  1. In the H2O Hydrogen Torch navigation menu, click Import dataset.
  2. In the S3 file name list, select select one of the Preprocessed Datasets in H2O Hydrogen Torch.
  3. Click Continue.
  4. Again, click Continue.

Note

  • After importing a preprocessed dataset, you will be able to use it for an experiment.

  • To learn how to preprocess your dataset for a particular supported problem type, see Dataset Formats

Preprocessed datasets in H2O Hydrogen Torch

Flower image classification

File name: flower_image_classification.zip

Description: The dataset contains images of dandelions, daisies, roses, tulips, and sunflowers.

To learn more about the dataset, see Flowers Dataset.

Dataset Columns: image, label

Problem Type: Image classification

Coins image regression

File name: coins_image_regression.zip

Description: The dataset contains a collection of images with one or more coins. Each image has been labeled to indicate the sum of its coins. The currency of the coins is the Brazilian Real (R$).

To learn more about the dataset, see Brazilian Coins.

Dataset Columns: image_path, label, fold

Problem Type: Image regression

Global wheat image object detection

File name: globalwheat_image_object_detection.zip

Description: The dataset contains a collection of images of wheat fields with bounding boxes for each identified wheat head.

To learn more about the dataset, see Global Wheat Dataset.

Dataset Columns: image, class_id, x_min, y_min, x_max, y_max

Problem Type: Single-class object detection

Amazon Review text classification

File name: amazon_reviews_text_classification.csv

Description: The dataset contains a collection of reviews from Amazon. Each review (in text form) includes the title of the review and the review itself. The dataset has been labeled to indicate whether a review is positive or negative.

To learn more about the dataset, see Amazon product data.

Dataset Columns: text, label

Problem Type: Text classification

Stanford bicycle image metric learning

File name: bicycle_image_metric_learning.zip

Description: The dataset contains images of online product ads for bicycles. Each ad has multiple images marked by their class ID.

To learn more about the dataset, see The Stanford Online Products dataset.

Dataset Columns: image, label, fold

Problem Type: Image metric learning

Fashion image semantic segmentation

File name: fashion_image_semantic_segmentation.zip

Description: The dataset contains images corresponding to fashion/apparel segmentations. This dataset contains images of people wearing various clothing types in multiple poses.

To learn more about the dataset, see Clothing Co-Parsing Dataset.

Dataset Columns: image, class_id, rle_mask

Problem Type: Semantic segmentation

CNN/Daily mail text sequence to sequence

File name: cnn_dailymail_text_sequence_to_sequence.zip

Description: The dataset contains human-generated abstract summaries from news stories published on the CNN and Daily Mail websites.

To learn more about the dataset, see abisee/cnn-dailymail.

Dataset Columns: text, summary, id

Problem Type: Text sequence to sequence

Well-formed query text regression

File name: wellformed_query_text_regression.csv

Description: The dataset contains a collection of search queries. Every query was rated between 0 and 1 specifying whether or not the query was well-formed.

To learn more about the dataset, see Query-wellformedness Dataset.

Dataset Columns: text, rating

Problem Type: Text regression

CoNLL-2003 text token classification

File name: conll2003_text_token_classification.zip

Description: The dataset contains a collection of text pieces that have their name entities specified. Name entities refer to abstract or physical objects such as a person, product, etc., that can be indicated with a proper name.

To learn more about the dataset, see Language-Independent Named Entity Recognition (II).

Dataset Columns: id, text, pos_tags, chunk_tags, ner_tags

Problem Type: Text token classification

Squad text span prediction

File name: squad_text_span_prediction.zip

Description: The dataset contains questions with answers and contexts that can be used to answer the questions.

To learn more about the dataset, see The Stanford Question Answering Dataset.

Dataset Columns: question, context, answer

Problem Type: Text span prediction

Ubuntu text metric learning

File name: ubuntu_text_metric_learning.zip

Description: The dataset contains a preprocessed collection of questions from AskUbuntu.com. Questions are grouped in similar clusters (label).

To learn more about the dataset and its use in research, refer to the following arXiv paper: Semi-supervised Question Retrieval with Gated Convolutions, NAACL 2016, Tao Lei et al.

To view the original dataset from the authors, visit the following Github repository: AskUbuntu Question Dataset.

Dataset Columns: text, label, fold

Problem Type: Text metric learning

COCO cars image instance segmentation

File name: coco_image_instance_segmentation.zip

Description: The dataset contains a subsample of the famous Common Objects in Context (COCO) dataset. This subsample includes only a single "Car" class. In other words, all images contain a car or multiple cars.

To learn more about the dataset, see COCO Dataset.

Dataset Columns: image_id, class_id, rle_mask

Problem Type: Image instance segmentation

Environmental sound audio classification

File name: esc10_audio_classification.zip

Description: The dataset contains 5-second-long recordings organized into ten classes (with 40 examples per class). Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project.

To learn more about the dataset, see ESC-50: Dataset for Environmental Sound Classification.

Dataset Columns: filename, fold, label

Problem Type: Audio classification

MNIST audio regression

File name: amnist_audio_regression.zip

Description: The dataset contains a collection of 30,000 audio samples of spoken digits (0-9) of sixty different speakers.

To learn more about the dataset, see Audio MNIST.

Dataset Columns: audio, label, fold

Problem Type: Audio regression


Back to top