💾 Archived View for fallible.dev › 2021 › shuffledarrays.gmi captured on 2022-06-11 at 20:56:20. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-03)
-=-=-=-=-=-=-
Nov. 24, 2021
Suppose we have an array of N elements and we need to reorder it randomly. That is a common scenario in machine learning problems when we want to extract a random sample of observations from a dataset. Another scenario is when we want to divide the dataset into random blocks, for example train and test subsets. A solution is to shuffle the dataset and take n sequential elements for each block.
To shuffle a dataset, we can start with an array of N sequential numbers that represents the order of the dataset observations. We shuffle that array and next, we use it as indices to retrieve the observations. A pseudocode solution is shown below.
Input: D : dataset N : number of observations Output: D' : suffled dataset x := 1..N x' := shuffle(x) D' := D[x[1]], D[x[2]], ..., D[x[N]] return D'
Therefore, we need a shuffle() function to efficiently shuffle the array of indices. For that, we can use the Fisher-Yates algorithm (1948) that runs in O(n) time. It uses a random number r between 1 and N-1 to swap the element at position N with the element at position r. As a result, the element at position N is already shuffled. The procedure is recursively repeated for the N-1 remaining elements.
Function: shuffle Input: N : array size Output: x : shuffled array x := 1..N i := N while i > 1: r := random(i - 1) v := x[i] x[i] := x[r] x[r] := v i := i - 1 return x
The implementation in C of the shuffle() function is presented below. A random number generator from the GNU Scientific Library (GSL) has been used. Indices have been adjusted because in C arrays start at 0 instead of 1.
#include <gsl/gsl_rng.h> int *shuffle(int N, int seed) { int i; int r, v; int *x = malloc(sizeof(int) * size); /* Random number generator allocation */ gsl_rng *rndgen = gsl_rng_alloc(gsl_rng_taus); gsl_rng_set(rndgen, seed); /* Vector initialization */ for (i = 0; i < size; i++) x[i] = i; /* Shuffling algorithm */ for (i = size - 1; i > 0; i--) { r = gsl_rng_get(rndgen) % i; v = x[r]; x[r] = x[i]; x[i] = v; } /* Cleanning */ gsl_rng_free(rndgen); return x; }
References:
Resources: