Data shuffling between distributed workers is one of the critical steps in implementing large-scale learning algorithms. The focus of this work is to understand the fundamental trade-off between the amount of storage and the communication overhead for distributed data shuffling. We first present an information theoretic formulation for the data shuffling problem, accounting for the underlying problem parameters (i.e., number of workers, K, number of data points, N, and the available storage, S per node). Then, we derive an information theoretic lower bound on the communication overhead for data shuffling as a function of these parameters. Next, we present a novel coded communication scheme and show that the resulting communication overhead of the proposed scheme is within a multiplicative factor of at most 2 from the lower bound. Furthermore, we introduce an improved aligned coded shuffling scheme, which achieves the optimal storage vs communication trade-off for K < 5, and further reduces the maximum multiplicative gap down to 7/6, for K ≥ 5.