Quote from: Mate on December 20, 2020, 22:45:08
QuoteImagine that instead of having AVX512 units inside your generic high performance cores, you'd have a core specifically designed for SIMD which could go even wider than 512. Yes, it's the iGPU
Actually GPU is also coprocessor, same thing with sound card on mother board
I think Mate is a bit confused on this. x86 does have certain properties which complicate implementation. And because of backward compatibility, they can't get rid of them.
I was trying to say exactly this. x86 decoders are a lot more complex and now you want to add additional tasks on them too.
Have you guys never heard of the "fusion" initiative by AMD, something that started after they purchased ATI?
They found that shoving things onto their own coprocessor, in this case an iGPU to handle FP tasks, was a really bad idea for most things.
Not only do you have massive instruction latency, you also have a massive power hog in moving the data off core into the accelerator. Then you have the issue of getting devs to actually use your special instructions, not easy, and then making sure to support those instructions for more or less every product you make with that capability(don't want someone to buy an upgrade to a part to get better AI perf and then them finding out you changed the instruction so their software no longer works).
And no, ARM is not immune to that last problem. ARM instructions go a very, very long time before being depreciated, ARMv8 shares a lot of instructions in common with even ARMv5.
And, again, it isn't really that hard to do in the first place, both AMD and Intel have done and will continue to do it for things like AI and video decode/encode, it just takes dev support; something harder to acquire than gold out of a duck's a**.
Quote
I'm certain it was one of key factors of CISC success in 80s when RAM was very small. Now it doesnt make almost any impact. Code occupying 2x more space in memory? No problem at all when simple GUI takes a lot more.
Registers and instruction caches are still really small dude. That's what the variable length is for. The 4800U actually has a smaller instruction cache than the athlon XPs did, half the size in fact.
The M1, meanwhile, has an absolutely humongous 192kb L1 instruction cache, this compared to Zen3's 32kb L1 instruction cache. And this cache is power hungry too, most of the power a CPU takes can be divided into three parts:
FP Units(now the biggest factor)
Cache(second biggest factor)
Front End(this includes decode, though that's very rapidly ceased to be a major component in things, partially thanks to decoder width staying very static over time, the Core architecture only lacked 1 decoder compared to Skylake, ditto K8 compared to Zen3)
The more cache you have, the more power hungry your part will be by default, regardless of anything else. It is unbelievably hard to clock gate cache, and the more you have the harder it becomes.
This is why x86-64 and even Itanium/Intel64 stuck with variable length instructions. Them being CISC just means variable length actually nets them a bit of decoding ease on top.
QuoteAgree - this is huge problem. x86 processors needs to run even code prepared for 286.... on the other hand ARM have only 32/64 switch, not compatibility with 16/32/64 and dozens of instruction sets extensions.
Modern x86 CPUs do not have native compatibility for 16 bit code, it's all emulated. That's why the x87 instruction set is so god damn slow on them. It's very similar to ARM in that way, older ISAs just get emulated and depreciated.