Profiling an application executing on a microprocessor is part of the solution to numerous software and hardware optimization and design automation problems. Most current profiling techniques suffer from runtime overhead, inaccuracy, or slowness, and the traditional non-intrusive method of using a logic analyzer doesn't work for today's system-on-a-chip having embedded cores. We introduce a novel on-chip memory architecture that overcomes these limitations. The architecture, which we call ProMem, is based on a pipelined binary tree structure. It achieves single-cycle throughput, so it can keep up with today's fastest pipelined processors. It can also be laid out efficiently and scales very well, becoming more efficient the larger it gets. The memory can be used in a wide-variety of common profiling situations, such as instruction profiling, value profiling, and network traffic profiling, which in turn can be used to guide numerous design automation tasks.