So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!
Data at https://gertlabs.com/rankings
Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.
It will be cool to measure models based on their RAW performance and measure them in terms of ROI - not some benchmark but something meaningful like we used this model to solve X.
That will be a massive mind shift and might justify the token expenditure.
The Xiaomi team really brought something to the table.
$2.61/M tokens * 1,000 tok/s = $9.40/hr
That would be pretty cheap for an 8-GPU node which would typically run around $45/hr or more. Guess this depends on how many parallel streams it can handle.
> "However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below"