Interesting, but the token limit is not impressive. Complex problems require very high token counts. I don't get to choose my LLM, simply because there's only two that are capable of non-toy problems.
I've no problems with others using the problem sets I've devised to test LLMs, and I did use DeepSeek - it is very good - when the problem sets were small enough, but the token count needs work.