The V(D)J recombination is the primary mechanism for generating a diverse repertoire of T-cell receptors (TCRs) essential to the adaptive immune system for recognizing a wide variety of diseases. However, modeling the TCR repertoire is computationally challenging as the total number of TCRs to be generated and processed can exceed 10^18 sequences. We propose a bit-wise implementation of the V(D) J recombination algorithm, which reduces the memory footprint and execution time by factors of 4 and 2, respectively, compared to the state-of-the-art GPU implementation. We also present a multi-GPU implementation, experimentally identify suitable workload partitioning strategies for both single-and multi-GPU implementations, and, finally, expose the relationship between the workload size and limited scalability offered by the algorithm on a cluster with up to eight GPUs. We show that the bit-wise implementation reduces the execution time from 40.5 hours to 19 hours on a single GPU and 4.4 hours on an eight-GPU configuration.