To improve Network-on-Chip (NoC) parallelism, this paper proposes a new collision array based workload assignment to increase data request cancellation. Through a task flow partitioning algorithm, we minimize sequential data access and then dynamically schedule tasks while minimizing router execution time. Experimental results show that this method can provide an average of 87.7% system throughput improvement and 41.4% router execution time reduction. This throughput improvement is the direct consequence of collision array. A 7x improvement was reported in  Fig. 7 when 32 threads are employed on a single core. The system can achieve 2.7 times of speedup. By investigating the performance-overhead tradeoff between different collision array sizes, we proved a maximum of 42.9% energy and area overheads saving, only with a cost of 23.6% performance degradation in term of router execution time.