Skip to content

Dataset formats

The dataset for one of the supported problem types needs to be formatted (prepared) by you in a certain way. You can find instructions on formatting your dataset for a particular supported problem type below.

Image regression

The data for an image regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3):

  1. The available data connectors require your dataset to be in a .zip file.

  2. A .csv file containing an image and label column(s) and an optional fold column. Columns:

    • image: The image column should include the names and image extensions of the images.

      Note

      • data directory

        Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.

    • label: A label column needs to represent a numerical target column.

      Note

      H2O Hydrogen Torch can train models that predict multiple labels at the same time. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.

    • N label columns: The N columns represent separate regression labels.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column above; H2O Hydrogen Torch will use this folder to run an image regression experiment.

    Note

    All images need to have an image extension. To learn about supported image extensions, see Supported Image Extensions for Image Processing.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require label column(s)

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The coins_image_regression.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image regression problem. The .zip file contains a .csv file and an image folder. The structure of the .zip file is as follows:

coins_image_regression.zip
│   └───coins_image_regression.csv
│   │
│   └───images
│       └───95_1477858074.jpg
│       └───95_1477858068.jpg
│       └───95_1477858062.jpg
│       ...

The first three rows of the .csv file are as follows:

image_path label fold
105_1479344562.jpg 105 1
105_1479344940.jpg 105 2
125_1479424716.jpg 125 1

Note

In this example, the data directory in the image column (image_path) is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

Image classification

The data for an image classification experiment needs a .zip file (1) containing a .csv file (2) and an image folder (3):

  1. The available data connectors require your dataset to be in a .zip file.

  2. A .csv file containing an image and N label columns. As well, the file can contain an optional fold column. Columns:

    • The image column should include the names and image extensions of the images.

      Note

      • data directory

        Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.

    • N label columns: The N columns represent either One-Hot Encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient.

      Note

      H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column above; H2O Hydrogen Torch will use this folder to run an image classification experiment.

    Note

    All images need to have an image extension. To learn about supported image extensions, see Supported Image Extensions for Image Processing.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require label column(s)

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The flower_image_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multi-class image classification problem. The structure of the .zip file is as follows:

    flower_image_classification.zip
│   └───train.csv
│   │
│   └───images
│       └───100080576_f52e8ee070_n.jpg
│       └───10043234166_e6dd915111_n.jpg
│       └───1008566138_6927679c8a.jpg
│       ...

The first three rows of the train.csv file are as follows:

image label
5777669976_a205f61e5b.jpg roses
4860145119_b1c3cbaa4e_n.jpg roses
15011625580_7974c44bce.jpg roses

Note

  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

Image metric learning

The data for an image metric learning experiment needs a .zip file (1) containing a .csv file (2) and an image folder (3):

  1. The available data connectors require your dataset to be in a .zip file.

  2. A .csv file containing an image and label column. As well, the file can contain an optional fold column. Columns:

    • The image column should include the names and image extensions of the images.

      Note

      • data directory

        Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.

    • label: The label column needs to represent the class names.

      Note

      Similar images should have the same class name.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column above; H2O Hydrogen Torch will use this folder to run an image metric learning experiment.

    Note

    All images need to have an image extension. To learn about supported image extensions, see Supported Image Extensions for Image Processing.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require a label column

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The bicycle_image_metric_learning.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image metric learning problem. The structure of the .zip file is as follows:

bicycle_image_metric_learning.zip
│   └───train.csv
│   │
│   images
│       └───181783211141_0.jpg
│       └───181596348104_1.jpg
│       └───171166528893_0.jpg
│       ...

The first three rows of the .csv file are as follows:

image label fold
181783211141_0.JPG 181783211141 0
181596348104_1.JPG 181596348104 2
171166528893_0.JPG 171166528893 0

Note

  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

Image object detection

The data for an image object detection experiment needs a .zip file (1) containing a .pq file (parquet, with pyarrow engine) (2) and an image folder (3):

  1. The available data connectors require your dataset to be in a .zip file.

  2. A .pq file containing an image and class_id column; the file should also contain an x_min, x_max, y_min, and y_max column corresponding to the bounding box locations. As well, the file can contain an optional fold column. Columns:

    • The image column should include the names and image extensions of the images.

      Note

      • data directory

        Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.

    • class_id: The class_id column should represent the class names of each box. Each row of the dataset should contain a list of class names, where each element in the list refers to a single box.

    • x_min, x_max, y_min, and y_max: The x_min, x_max, y_min, and y_max columns should correspond to the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a list of coordinates, where each element in the list refers to a single box.

      Note

      • The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.

      • The length of each list for class_id, x_min, x_max, y_min, and y_max needs to be equal and needs to refer to the total number of object boxes in each respective image. If a box is not present for a given image, all lists need to be empty.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column above; H2O Hydrogen Torch will use this folder to run an image object detection experiment.

    Note

    All images need to have an image extension. To learn about supported image extensions, see Supported Image Extensions for Image Processing.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───pq_name.pq (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require class_id, x_min, x_max, y_min, and y_max columns

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The global_wheat_image_object_detection.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image object detection problem. The structure of the .zip file is as follows:

global_wheat_image_object_detection.zip
│   └───train.pq
│   │
│   └───images
│       └───7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg
│       └───3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg
│       └───37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg
│       ...

As follows, three random rows from the .pq file:

image class_id x_min y_min x_max y_max
7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg ['wheat' 'wheat' 'wheat' ...] [689 718 382 ...] [884 464 42 ...] [754 768 450 ...] [920 516 101 ...]
3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg ['wheat' 'wheat' 'wheat' ...] [924 698 904 ...] [195 10 32 ...] [981 763 938 ...] [247 101 79 ...]
37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg ['wheat' 'wheat' 'wheat' ...] [919 811 4 ...] [535 820 96 ...] [1024 912 71 ...] [613 894 164 ...]

Note

  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

Convert .csv file with bounding boxes
import pandas as pd


# Read data
df = pd.read_csv("/data/train.csv")

# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: x.to_list()).reset_index()

df[["image_id", "class_id", "x_min", "y_min", "x_max", "y_max"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)
Convert COCO format
import json

import pandas as pd


def get_object_detection(df):
    images = pd.DataFrame(df["images"])
    categories = pd.DataFrame(df["categories"])
    annotations = pd.DataFrame(df["annotations"])

    annotations["x_min"] = annotations["bbox"].map(lambda x: x[0]).astype(int)
    annotations["y_min"] = annotations["bbox"].map(lambda x: x[1]).astype(int)
    annotations["x_max"] = annotations["bbox"].map(lambda x: x[0] + x[2]).astype(int)
    annotations["y_max"] = annotations["bbox"].map(lambda x: x[1] + x[3]).astype(int)

    annotations = annotations[
        ["image_id", "category_id", "x_min", "y_min", "x_max", "y_max"]
    ]

    annotations = annotations.merge(
        images[["id", "file_name"]], left_on="image_id", right_on="id"
    )
    annotations = annotations.merge(
        categories[["id", "name"]], left_on="category_id", right_on="id"
    )

    annotations["category_id"] = annotations["category_id"] - 1

    annotations.drop(["id_x", "id_y", "image_id"], axis=1, inplace=True)

    return annotations


# Read data
with open("/data/COCO_train_annos.json", "r") as fp:
    train = json.load(fp)

# Parse COCO format
train_ann = get_object_detection(train)

# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: x.to_list()).reset_index()
train_ann[["file_name", "category_id", "x_min", "y_min", "x_max", "y_max"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)
Convert pascal VOC format
import glob
import os
from xml.etree import ElementTree

import pandas as pd
from tqdm import tqdm


observations = []

for xml in tqdm(glob.glob("/data/Annotations/*.xml")):
    tree = ElementTree.parse(xml)
    root = tree.getroot()
    objs = root.findall("object")

    for obj in objs:
        name = obj.find("name").text

        bndbox = obj.find("bndbox")
        xmin = int(bndbox.findtext("xmin")) - 1
        ymin = int(bndbox.findtext("ymin")) - 1
        xmax = int(bndbox.findtext("xmax"))
        ymax = int(bndbox.findtext("ymax"))

        observations.append(
            (
                os.path.split(xml)[-1].replace(".xml", ".jpg"),
                name,
                xmin,
                ymin,
                xmax,
                ymax,
            )
        )

df = pd.DataFrame(
    observations, columns=["image", "class_id", "x_min", "y_min", "x_max", "y_max"]
)

# Prepare the processed dataset
df = df.groupby(["image"]).agg(lambda x: x.to_list()).reset_index()
df.to_parquet("/data/train.pq", engine="pyarrow", index=False)

Image semantic segmentation

The data for an image semantic segmentation experiment needs a .zip file (1) containing a .pq file (parquet, with pyarrow engine) (2) and an image folder (3):

  1. The available data connectors require your dataset to be in a .zip file.

  2. A .pq file containing an image, class_id, and rle_mask column. As well, the file can contain an optional fold column. Columns:

    • The image column should include the names and image extensions of the images.

      Note

      • data directory

        Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.

    • class_id: The class_id column should represent the class names of each mask. Each row of the dataset should contain a list of all possible class names.

    • rle_mask: The rle_mask column should represent run-length-encoded (RLE) masks for each class from the class_id column. If there is no mask for a given class, an empty string has to be provided.

      Note

      • The length of each class_id and rle_mask list must be equal while referring to the total number of classes.
    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column above; H2O Hydrogen Torch will use this folder to run an image semantic segmentation experiment.

    Note

    All images need to have an image extension. To learn about supported image extensions, see Supported Image Extensions for Image Processing.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───pq_name.pq (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require class_id and rle_mask columns

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The fashion_image_semantic_segmentation.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image semantic segmentation problem. The structure of the .zip file is as follows:

fashion_image_semantic_segmentation.zip
│   └───train.pq
│   │
│   └───images
|       └───img_0458.png
|       └───img_0604.png    
│       └───img_0668.png
│           ...

As follows, three random rows from the .pq file:

image class_id rle_mask
img_0458.png ['shoes' 'pants' 'dress' 'coat' 'shirt'] ['180629 7 181447 17...
img_0604.png ['shoes' 'pants' 'dress' 'coat' 'shirt'] ['189672 2 190493 9...
img_0668.png ['shoes' 'pants' 'dress' 'coat' 'shirt'] ['108023 11 108848 11...

Note

  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

RLE encoding and decoding functions
from typing import Tuple

import numpy as np


def mask2rle(x: np.ndarray) -> str:
    """
    Converts input masks into RLE-encoded strings.

    Args:
        x: numpy array of shape (height, width), 1 - mask, 0 - background
    Returns:
        RLE string
    """

    pixels = x.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return " ".join(str(x) for x in runs)


def rle2mask(mask_rle: str, shape: Tuple[int, int]) -> np.ndarray:
    """
    Converts RLE-encoded string into the binary mask.

    Args:
        mask_rle: RLE-encoded string
        shape: (height,width) of array to return
    Returns:
        binary mask: 1 - mask, 0 - background
    """

    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0] * shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape, order="F")  # Needed to align to RLE direction
Convert .csv file with masks
import pandas as pd

df = pd.read_csv("/data/train.csv")

# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: x.to_list()).reset_index()

df[["image_id", "class_id", "rle_mask"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)
Convert COCO format
import json

import pandas as pd
from pycocotools.coco import COCO


def get_semantic_segmentation(df, coco_path):
    coco = COCO(coco_path)

    images = pd.DataFrame(df["images"])
    categories = pd.DataFrame(df["categories"])
    annotations = pd.DataFrame(df["annotations"])

    # mask2rle() is defined in "RLE encoding and decoding functions" section
    rles = [mask2rle(coco.annToMask(x)) for x in df["annotations"]]
    annotations["rle"] = rles
    annotations.loc[annotations.rle == "", "rle"] = float("nan")

    annotations = annotations[["image_id", "category_id", "rle"]]

    annotations = annotations.merge(
        images[["id", "file_name"]], left_on="image_id", right_on="id"
    )
    annotations = annotations.merge(
        categories[["id", "name"]], left_on="category_id", right_on="id"
    )

    annotations["category_id"] = annotations["category_id"] - 1

    annotations.drop(["id_x", "id_y", "image_id"], axis=1, inplace=True)

    return annotations


# Read data
train_path = "/data/COCO_train_annos.json"
with open(train_path, "r") as fp:
    train = json.load(fp)

# Parse COCO format
train_ann = get_semantic_segmentation(df=train, coco_path=train_path)

all_labels = [
    pd.DataFrame(
        {"file_name": x, "category_id": list(range(train_ann.category_id.nunique()))}
    )
    for x in train_ann.file_name.unique()
]
all_labels = pd.concat(all_labels)
train_ann = all_labels.merge(train_ann, how="left", on=["file_name", "category_id"])

# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: x.to_list()).reset_index()
train_ann[["file_name", "category_id", "rle"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Image instance segmentation

The data for an image instance segmentation experiment needs a .zip file (1) containing a .pq file (parquet, with pyarrow engine) (2) and an image folder (3):

  1. The available data connectors require your dataset to be in a .zip file.

  2. A .pq file containing an image, class_id, and rle_mask column. As well, the file can contain an optional fold column. Columns:

    • image: The image column should include the names and image extensions of the images.

      Note

      • data directory

        Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.

    • class_id: The class_id column should represent the class names of each instance mask. Each row of the dataset should contain a list of class names, where each element in the list refers to a single mask instance.

    • rle_mask: The rle_mask column should represent run-length-encoded (RLE) masks for each instance from the class_id column. Each row of the dataset should contain a list of RLE-encoded masks, where each element in the list refers to a single instance.

      Note

      • The length of each class_id and rle_mask list must be equal while referring to the total number of instances in each respective image. If an instance is not present for a given image, all lists need to be empty.
    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column above; H2O Hydrogen Torch will use this folder to run an image instance segmentation experiment.

    Note

    All images need to have an image extension. To learn about supported image extensions, see Supported Image Extensions for Image Processing.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───pq_name.pq (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require class_id and rle_mask columns

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The coco_image_instance_segmentation.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image instance segmentation problem. The structure of the .zip file is as follows:

coco_image_instance_segmentation.zip
│   └───train.pq
│   │
│   └───images
│       └───000000151231.jpg
│       └───000000433826.jpg
│       └───000000061159.jpg
│           ...

As follows, three random rows from the .pq file:

image_id class_id rle_mask
000000151231.jpg ['car' 'car'] ['91949 7 92375 14 92801...
000000433826.jpg ['car' 'car'] ['224473 3 224952 4 22...
000000061159.jpg ['car' 'car'] ['161665 9 162291 25...

Note

  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

RLE encoding and decoding functions
from typing import Tuple

import numpy as np


def mask2rle(x: np.ndarray) -> str:
    """
    Converts input masks into RLE-encoded strings.

    Args:
        x: numpy array of shape (height, width), 1 - mask, 0 - background
    Returns:
        RLE string
    """

    pixels = x.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return " ".join(str(x) for x in runs)


def rle2mask(mask_rle: str, shape: Tuple[int, int]) -> np.ndarray:
    """
    Converts RLE-encoded string into the binary mask.

    Args:
        mask_rle: RLE-encoded string
        shape: (height,width) of array to return
    Returns:
        binary mask: 1 - mask, 0 - background
    """

    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0] * shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape, order="F")  # Needed to align to RLE direction
Convert .csv file with masks
import pandas as pd


df = pd.read_csv("/data/train.csv")

# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: x.to_list()).reset_index()

df[["image_id", "class_id", "rle_mask"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Text regression

The data for a text regression experiment needs a single .csv file or a .zip file containing a .csv file:

  1. The available data connectors require your dataset to be in a .zip file or a single .csv file.

  2. A .csv file containing a text and label column(s) and an optional fold column. Columns:

    • text: The text column contains the text input of each sample.

    • label: A label column needs to represent a numerical target column.

      Note

      H2O Hydrogen Torch can train models that predict multiple labels at the same time. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.

    • N label columns: The N columns represent separate regression labels.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───csv_name.csv (2)

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require label column(s)

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The wellformed_query_text_regression.csv file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text regression problem.

As follows, two random rows from the .csv file:

rating
text
0.2 The European Union includes how many ?
1.0 What is released when an ion is formed ?

Note

  • The rating column refers to the label column.

  • A fold column is not specified, and therefore, five-folds are assigned randomly.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

Text classification

The data for a text classification experiment needs a single .csv file or a .zip file containing a .csv file:

  1. The available data connectors require your dataset to be in a .zip file or a single .csv file.

  2. A .csv file containing a text and N label columns. As well, the file can contain an optional fold column. Columns:

    • text: The text column contains the text input of each sample.

    • N label columns: The N columns represent either One-Hot Encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient.

      Note

      H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───csv_name.csv (2)

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require label column(s)

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The amazon_reviews_text_classification.csv file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text classification problem.

The first two rows of the .csv file are as follows:

text
label
GREAT!!!!! Review: I got this toy a couple of days ago and I ABSOLUTELY LOVE IT! It is so much more realistic looking than my other baby born comfort seat. All though I dont have a baby born I had one before but I sold it at a garage sale. So I use It for my berenguar baby doll. And it even has the buckle that goes across the shoulder like a real babies car seat!!!! DEFFINATELY WORTH THE MONEY!!!!!! Positive
This Or "Dixie Chicken" Presents Them At A Peak Review: Though lyrically the overall feel of this record is slightly provincial, it can still transport me to places I wanna be. Musically, this pop product from California is stylistically consistent. Yet the instrumentation is diverse and each member is resourceful. But it's Lowell George's vocals and slide guitar that are primarily at the center. He's not flashy and that's a positive. You get treated to 12-bar blues, a song of prescription meds for tripping and a blues with an accordian.But the three highlights are "Easy To Slip", a jaunty acoustic/electric number about lighting up and the sheer joy that memory drifting can project, "Teenage Nervous Breakdown" in which they switch to the domain of energy-driven rock and roll and the title track, a leisurely-paced country blues in which a generous helping of background vocals provides just the right amount of tension. Positive

Note

  • A fold column is not specified, and therefore, five-folds are assigned randomly.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

Text sequence to sequence

The data for a text sequence to sequence experiment needs a single .csv file or a .zip file containing a .csv file:

  1. The available data connectors require your dataset to be in a .zip file or a single .csv file.

  2. A .csv file containing an input_text and output_text column and an optional fold column. Columns:

    • input_text: The input_text column needs to represent the input text.

    • output_text: The output_text column needs to represent the output text.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
└───csv_name.csv (2)

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require an output_text column

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The cnn_dailymail_text_sequence_to_sequence.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text sequence to sequence problem. The structure of the .zip file is as follows:

cnn_dailymail_text_sequence_to_sequence.zip
│   └───train.csv

As follows, a random row from the .csv file:

text
summary
id
It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction." It's a step that is set to turn an international crisis into a fierce domestic political battle. There are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react? In a televised address from the White House Rose Garden earlier Saturday, the president said he would take his case to Congress, not because he has to -- but because he wants to. "While I believe I have the authority to carry out this military action without specific congressional authorization, I know that the country will be stronger if we take this course, and our actions will be even more effective," he said. "We should have this debate, because the issues are too big for business as usual." Obama said top congressional leaders had agreed to schedule a debate when the body returns to Washington on September 9. The Senate Foreign Relations Committee will hold a hearing over the matter on Tuesday, Sen. Robert Menendez said. Transcript: Read Obama's full remarks . Syrian crisis: Latest developments . U.N. inspectors leave Syria . Obama's remarks came shortly after U.N. inspectors left Syria, carrying evidence that will determine whether chemical weapons were used in an attack early last week in a Damascus suburb. "The aim of the game here, the mandate, is very clear -- and that is to ascertain whether chemical weapons were used -- and not by whom," U.N. spokesman Martin Nesirky told reporters on Saturday. But who used the weapons in the reported toxic gas attack in a Damascus suburb on August 21 has been a key point of global debate over the Syrian crisis. Top U.S. officials have said there's no doubt that the Syrian government was behind it, while Syrian officials have denied responsibility and blamed jihadists fighting with the rebels. British and U.S. intelligence reports say the attack involved chemical weapons, but U.N. officials have stressed the importance of waiting for an official report from inspectors. The inspectors will share their findings with U.N. Secretary-General Ban Ki-moon Ban, who has said he wants to wait until the U.N. team's final report is completed before presenting it to the U.N. Security Council. The Organization for the Prohibition of Chemical Weapons, which nine of the inspectors belong to, said Saturday that it could take up to three weeks to analyze the evidence they collected. "It needs time to be able to analyze the information and the samples," Nesirky said. He noted that Ban has repeatedly said there is no alternative to a political solution to the crisis in Syria, and that "a military solution is not an option." Bergen: Syria is a problem from hell for the U.S. Obama: 'This menace must be confronted' Obama's senior advisers have debated the next steps to take, and the president's comments Saturday came amid mounting political pressure over the situation in Syria. Some U.S. lawmakers have called for immediate action while others warn of stepping into what could become a quagmire. Some global leaders have expressed support, but the British Parliament's vote against military action earlier this week was a blow to Obama's hopes of getting strong backing from key NATO allies. On Saturday, Obama proposed what he said would be a limited military action against Syrian President Bashar al-Assad. Any military attack would not be open-ended or include U.S. ground forces, he said. Syria's alleged use of chemical weapons earlier this month "is an assault on human dignity," the president said. A failure to respond with force, Obama argued, "could lead to escalating use of chemical weapons or their proliferation to terrorist groups who would do our people harm. In a world with many dangers, this menace must be confronted." Syria missile strike: What would happen next? Map: U.S. and allied assets around Syria . Obama decision came Friday night . On Friday night, the president made a last-minute decision to consult lawmakers. What will happen if they vote no? It's unclear. A senior administration official told CNN that Obama has the authority to act without Congress -- even if Congress rejects his request for authorization to use force. Obama on Saturday continued to shore up support for a strike on the al-Assad government. He spoke by phone with French President Francois Hollande before his Rose Garden speech. "The two leaders agreed that the international community must deliver a resolute message to the Assad regime -- and others who would consider using chemical weapons -- that these crimes are unacceptable and those who violate this international norm will be held accountable by the world," the White House said. Meanwhile, as uncertainty loomed over how Congress would weigh in, U.S. military officials said they remained at the ready. 5 key assertions: U.S. intelligence report on Syria . Syria: Who wants what after chemical weapons horror . Reactions mixed to Obama's speech . A spokesman for the Syrian National Coalition said that the opposition group was disappointed by Obama's announcement. "Our fear now is that the lack of action could embolden the regime and they repeat his attacks in a more serious way," said spokesman Louay Safi. "So we are quite concerned." Some members of Congress applauded Obama's decision. House Speaker John Boehner, Majority Leader Eric Cantor, Majority Whip Kevin McCarthy and Conference Chair Cathy McMorris Rodgers issued a statement Saturday praising the president. "Under the Constitution, the responsibility to declare war lies with Congress," the Republican lawmakers said. "We are glad the president is seeking authorization for any military action in Syria in response to serious, substantive questions being raised." More than 160 legislators, including 63 of Obama's fellow Democrats, had signed letters calling for either a vote or at least a "full debate" before any U.S. action. British Prime Minister David Cameron, whose own attempt to get lawmakers in his country to support military action in Syria failed earlier this week, responded to Obama's speech in a Twitter post Saturday. "I understand and support Barack Obama's position on Syria," Cameron said. An influential lawmaker in Russia -- which has stood by Syria and criticized the United States -- had his own theory. "The main reason Obama is turning to the Congress: the military operation did not get enough support either in the world, among allies of the US or in the United States itself," Alexei Pushkov, chairman of the international-affairs committee of the Russian State Duma, said in a Twitter post. In the United States, scattered groups of anti-war protesters around the country took to the streets Saturday. "Like many other Americans...we're just tired of the United States getting involved and invading and bombing other countries," said Robin Rosecrans, who was among hundreds at a Los Angeles demonstration. What do Syria's neighbors think? Why Russia, China, Iran stand by Assad . Syria's government unfazed . After Obama's speech, a military and political analyst on Syrian state TV said Obama is "embarrassed" that Russia opposes military action against Syria, is "crying for help" for someone to come to his rescue and is facing two defeats -- on the political and military levels. Syria's prime minister appeared unfazed by the saber-rattling. "The Syrian Army's status is on maximum readiness and fingers are on the trigger to confront all challenges," Wael Nader al-Halqi said during a meeting with a delegation of Syrian expatriates from Italy, according to a banner on Syria State TV that was broadcast prior to Obama's address. An anchor on Syrian state television said Obama "appeared to be preparing for an aggression on Syria based on repeated lies." A top Syrian diplomat told the state television network that Obama was facing pressure to take military action from Israel, Turkey, some Arabs and right-wing extremists in the United States. "I think he has done well by doing what Cameron did in terms of taking the issue to Parliament," said Bashar Jaafari, Syria's ambassador to the United Nations. Both Obama and Cameron, he said, "climbed to the top of the tree and don't know how to get down." The Syrian government has denied that it used chemical weapons in the August 21 attack, saying that jihadists fighting with the rebels used them in an effort to turn global sentiments against it. British intelligence had put the number of people killed in the attack at more than 350. On Saturday, Obama said "all told, well over 1,000 people were murdered." U.S. Secretary of State John Kerry on Friday cited a death toll of 1,429, more than 400 of them children. No explanation was offered for the discrepancy. Iran: U.S. military action in Syria would spark 'disaster' Opinion: Why strikes in Syria are a bad idea . Syrian official: Obama climbed to the top of the tree, "doesn't know how to get down" Obama sends a letter to the heads of the House and Senate . Obama to seek congressional approval on military action against Syria . Aim is to determine whether CW were used, not by whom, says U.N. spokesman . 0001d1afc246a7964130f43ae940af6bc6c57f01

Note

  • In this example, the text column refers to the input_text column, while the summary column refers to the output_text column.

  • A fold column is not specified, and therefore, five-folds are assigned randomly.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

Text span prediction

The data for a text span prediction experiment needs a single .csv file or a .zip file containing a .csv file:

  1. The available data connectors require your dataset to be in a .zip file or a single .csv file.

  2. A .csv file containing a context, question, and answer column and an optional answer_start and fold column. Columns:

    • context: The context column contains the text input of each sample.

    • question: The question column needs to represent a question to the context.

    • answer: The answer column needs to represent a substring from the original text with an answer.

    • answer_start: The optional answer_start column represents the start of the answer text in the context column. Values for this column should be integers representing the index. If this column is not provided, H2O Hydrogen Torch will select the first occurrence in the document.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
|   └───csv_name.csv (2)

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require an answer column

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The squad_text_span_prediction.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text span prediction problem. The structure of the .zip file is as follows:

squad_text_span_prediction.zip
│   └───squad_v1.csv

As follows, a random row from the .csv file:

question
context answer
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. Saint Bernadette Soubirous

Note

A fold column is not specified, and therefore, five-folds are assigned randomly, which is sometimes not the desired strategy.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

Text token classification

The data for a text token classification experiment needs a single .pq (parquet with pyarrow engine) file or a .zip file containing a .pq file:

  1. The available data connectors require your dataset to be in a .zip file or a single .pq file.

  2. A .pq file containing a text and label column and an optional fold column. Columns:

    • text: The text column should contain tokenized text; each sample should have a list of string tokens.

    • label: The label column should contain token labels for the tokenized text; each sample should have a list of token labels. Labels should be represented as categorical string values.

    • fold: The optional fold column specifies the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───pq_name.pq (2)

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require a label column

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The conll2003_text_token_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text token classification problem. The structure of the .zip file is as follows:

conll2003_text_token_classification.zip
│   └───test.pq
│   └───train.pq
│   └───validation.pq

As follows, a random row from the train.pq file:

id
text pos_tags chunk_tags ner_tags
4158 ['Nijmeh' 'of' 'Lebanon' 'beat' 'Nasr' 'of' 'Saudi' 'Arabia' '1-0' '(' 'halftime' '1-0' ')' 'in' 'their' 'Asian' 'club' 'championship' 'second' 'round' 'first' 'leg' 'tie' 'on' 'Saturday' '.'] ['NNS' 'IN' 'NNP' 'VBD' 'NNP' 'IN' 'NNP' 'NNP' 'NNP' '(' 'NN' 'CD' ')' 'IN' 'PRP$' 'JJ' 'NN' 'NN' 'NN' 'NN' 'JJ' 'NN' 'NN' 'IN' 'NNP' '.'] ['B-NP' 'B-VP' 'B-VP' 'I-VP' 'B-NP' 'I-NP' 'B-PP' 'B-NP' 'O' 'O' 'B-NP' 'B-NP' 'I-NP' 'I-NP' 'B-PP' 'B-NP' 'I-NP' 'B-NP' 'I-NP' 'B-VP' 'B-NP' 'B-PP' 'B-VP' 'O'] ['B-ORG' 'O' 'B-LOC' 'O' 'B-ORG' 'O' 'B-LOC' 'I-LOC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-MISC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O']

Note

  • The *_tags columns refer to the label column and can only be selected when running a text token classification experiment. Only one column from the available label columns can be selected when running an experiment.

  • A fold column is not specified, and therefore, five-folds are assigned randomly.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

Convert CoNLL-2003 dataset
from pathlib import Path

import pandas as pd

try:
    import datasets
except ImportError:
    raise ImportError("Need datasets>=1.11.0 to download English CoNLL2003 data!")

dataset = datasets.load_dataset("conll2003")

for subset in dataset:
    out_path = Path(f"/data/conll2003/{subset}.pq")
    out_path.parent.mkdir(exist_ok=True, parents=True)

    df = pd.DataFrame(dataset[subset])

    # Decode the label encoded labels
    for feature in dataset[subset].features:
        if isinstance(dataset[subset].features[feature], datasets.Sequence):
            feat = dataset[subset].features[feature].feature

            if isinstance(feat, datasets.ClassLabel):
                df[feature] = df[feature].apply(feat.int2str)

    df.rename(columns={"tokens": "text"}, inplace=True)

    df.to_parquet(out_path, engine="pyarrow", index=False)

Text metric learning

The data for a text metric learning experiment needs a single .csv file or a .zip file containing a .csv file:

  1. The available data connectors require your dataset to be in a .zip file or a single .csv file.

  2. A .csv file containing a text and label column and an optional fold column. Columns:

    • text: The text column contains the text input of each sample.

    • label: The label column needs to represent the class names.

      Note

      Texts that are similar should have the same class name.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───csv_name.csv (2)

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require a label column

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data Connectors.

The ubuntu_text_metric_learning.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text metric learning problem. The structure of the .zip file is as follows:

ubuntu_text_metric_learning.zip
│   └───train.csv
│   └───test.csv

As follows, a random row from the train.csv file:

text label fold
what is the easiest way to strip a desktop edition to a server edition ? 16 1

Note

  • A fold column is not specified, and therefore, five-folds are assigned randomly.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed Datasets.

Audio regression

The data for an audio regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3):

  1. The available data connectors require your dataset to be in a .zip file.

  2. A .csv file containing an audio and label column(s) and an optional fold column. Columns:

    • audio: The audio column should include the names and extensions of the audio files.

      Note

      • data directory

        Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.

    • label: A label column needs to represent a numerical target column.

      Note

      H2O Hydrogen Torch can train models that predict multiple labels at the same time. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.

    • N label columns: The N columns represent separate regression labels.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An audio folder that contains all the audio files specified in the audio column above; H2O Hydrogen Torch will use this folder to run an audio regression experiment.

    Note

    All audios need to have an audio extension. To learn about supported audio extensions, see Supported audio extensions for audio processing.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───audio_folder_name (3)
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require label column(s)

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data connectors.

The amnist_audio_regression.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an audio regression problem. The .zip file contains a .csv file and an audio folder. The structure of the .zip file is:

amnist_audio_regression.zip
│   └───amnist_meta.csv
│   │
│   └───amnist_audios
│        └───0_01_0.ogg
│        └───0_01_1.ogg
│        └───0_01_2.ogg
│           ...

The first three rows of the .csv file are:

audio label fold
2_26_2.ogg 2 0
2_26_38.ogg 2 1
9_26_47.ogg 9 2

Note

In this example, the data directory in the audio column is not specified. That being the case, it needs to be specified when uploading the dataset, and the amnist_audios folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Audio classification

The data for an audio classification experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3):

  1. The available data connectors require your dataset to be in a .zip file.

  2. A .csv file containing an audio and N label columns. As well, the file can contain an optional fold column. Columns:

    • The audio column should include the names and audio extensions of the audio files.

      Note

      • data directory

        Suppose the names of the audios don't specify the data directory (location of the audio files in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.

    • N label columns: The N columns represent either multi-class labels (One-Hot Encoded) or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient.

      Note

      H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. The classes are mutually exclusive in multi-class problems, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.

    • fold: The optional fold column should specify the cross-validation fold index assignment per observation (row).

      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value form a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An audio folder that contains all the audio files specified in the audio column above; H2O Hydrogen Torch will use this folder to run an audio classification experiment.

    Note

    All audios need to have an audio extension. To learn about supported audio extensions, see Supported audio extensions for audio processing.

With the above in mind, the .zip file should be structured as follows:

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───audio_folder_name (3) 
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • The train dataframe needs to follow the format described above

  • The validation dataframe should have the same format as the train dataframe

  • The test dataframe should have the same format as the train dataframe but does not require label column(s)

To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data connectors.

The esc10_audio_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multiclass audio classification problem. The structure of the .zip file is:

esc10_audio_classification.zip
│   └───esc10_meta.csv
│   │
│   └───audio_esc10
│       └───2-37806-B-40.wav
│       └───5-200339-A-1.wav
│       └───1-172649-D-40.wav
│       ...

The first three rows of the .csv file are:

filename fold label
1-100032-A-0.wav 0 dog
1-110389-A-0.wav 0 dog
1-116765-A-41.wav 0 chainsaw

Note

In this example, the data directory in the filename column is not specified. That being the case, it needs to be specified when uploading the dataset, and the audio_files folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Supported audio extensions for audio processing

The following is a list of supported audio extensions for audio processing in H2O Hydrogen Torch:

  • Uncompressed: .wav, .aiff
  • Lossless compressed: .flac
  • Lossy compressed: .mp3, .ogg

Supported image extensions for image processing

The following is a list of supported image extensions for image processing in H2O Hydrogen Torch:

  • Windows bitmaps: .bmp
  • JPEG files: .jpeg, .jpg, .jpe
  • JPEG 2000 files: .jp2
  • Portable Network Graphics: .png
  • WebP: .webp
  • Portable image format: .pbm, .pgm, .ppm, .pnm
  • TIFF files: .tiff, .tif
  • OpenEXR Image files: .exr
  • Radiance HDR: .hdr
  • NumPy data array: .npy (data must be of shape [height, width, channels])


Back to top