The problem with something like HBM is going to be cost. So, we're probably looking at a cache. And in that case, I wouldn't be surprised if main memory moved away from DDR SDRAM and towards volatile flash. Volatile because it has much better endurance than persistent flash. Persistent writes are what's responsible for most of the wear. After all, volatile flash is already being used in servers. With standard DDR SDRAM acting as cache. Because flash is a lot cheaper. There is a lot of appetite for capacity in the server world, even at the expense of latency (large SDRAM modules also sacrifice latency for capacity).
Sure, you can change topology. Split a big processor with many cores into smaller packages and then integrate them onto memory modules (thus creating a compute module). Yes, you'll have compute cores closer to memory. But in doing so, you're putting cores further apart. In some applications, this would be perfect. In others, you've got to deal with a lot of core-to-core communication, access to shared memory, memory consistency.
Historically, there has been a trade-off between latency and capacity. That's why modern processors have multiple levels of caches. Even RAM can be seen as a cache for persistent storage. Ultimately, it's a compromise. And benefit depends on the workload and how well prefetching works. Sometimes, you can do a very good job of hiding latency. Reducing something that's hidden doesn't do you much good. There is also throughput. And if you're bound by that, it generally dictates how many cores it makes sense to install. It's more related to compute density. If you're throughput limited, it means fewer cores per socket and that means fewer cores per rack, which means more racks for the same compute power, which means more floorspace and longer cables.
I don't know how relevant is this to consumers. Unless they want to go the integrated path like Apple. Which would ultimately mean less choice for consumers.