Data Transformation with PHP
Encoding Categorical Variables with Rubix
Categorical data, like "color" or "size," must be converted into numerical form for machine learning models to process it. One common approach is One-Hot Encoding, which represents each category as a binary vector. This method creates separate columns for each category, assigning a 1 if the category is present and a 0 if it is not. The main goal of One-Hot Encoding is to make categorical data usable in machine learning models.
Dataset
red,small
blue,medium
yellow,medium
green,large
dark,super-large
Example of use:
<?php
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Extractors\CSV;
use Rubix\ML\Transformers\OneHotEncoder;
// Load the dataset using CSV
$dataset = Unlabeled::fromIterator(new CSV(dirname(__FILE__) . '/data/colors_and_size.csv', false));
$encoder = new OneHotEncoder();
$encoder->fit($dataset);
$samples = $dataset->samples();
$transformedSamples = $samples;
$encoder->transform($transformedSamples);
echo "\nAfter Encoding:\n";
echo "--------------\n";
foreach ($transformedSamples as $ind => $sample) {
echo str_pad($samples[$ind][0], 10) . implode('', $sample) . "\n";
}