With the newest steady launch dated January 28, 2025, Qwen2.5-Max is assessed as a Combination-of-Specialists (MoE) language mannequin developed by Alibaba. Like different language fashions, Qwen2.5-Max is able to producing textual content, understanding completely different languages, and performing superior logic. Based on current benchmarks, it’s also safer than DeepSeek-V3-0324.
Utilizing Recon to scan for vulnerabilities
A staff of analysts with Defend AI, the corporate behind a pink teaming and safety vulnerability scanning instrument often called Recon, just lately used their platform to check the safety of Qwen2.5-Max in opposition to that of DeepSeek-V3.
The staff’s evaluation reads, partially: “We noticed that DeepSeek-V3-0324 is extra weak than Qwen2.5-Max, with Recon reaching an nearly 25% larger assault success fee (ASR).”
Whereas it could be safer than its competitors, Qwen2.5-Max isn’t precisely excellent. Based on their exams, the AI mannequin is most inclined to immediate injection assaults, as these represented nearly 48% of all profitable cyberattacks in opposition to Qwen2.5-Max. Evasion and jailbreak assaults proved to be much less profitable with an approximate ASR of 40% for each.
Exposing vulnerabilities in DeepSeek-V3
Recon makes use of a complete Assault Library to scan current-gen AI fashions and determine vulnerabilities throughout six particular classes:
- Evasion strategies
- System immediate leaks
- Immediate injection assaults
- AI jailbreak makes an attempt
- Normal security controls
- Adversarial suffix resistance
Along with simulated cyberattacks, Recon additionally assesses the AI fashions’ resistance to producing doubtlessly dangerous or unlawful content material. For instance, throughout adversarial suffix resistance exams, Recon makes an attempt to govern the AI mannequin into producing dangerous or unlawful content material.
The Defend AI staff ran Recon in opposition to each Qwen2.5-Max and DeepSeek-V3, with the previous boasting a decrease assault success fee (ASR) throughout quite a lot of assaults; together with jailbreaks, immediate injection, and evasion strategies.
Whereas Qwen2.5-Max had a 47% ASR in opposition to immediate injection assaults, in comparison with DeepSeek-V3’s notably larger 77%. Towards evasion strategies, Qwen2.5-Max scored a 39.4% ASR in opposition to evasion strategies, whereas DeepSeek-V3 scored 69.2%. Each AI fashions displayed related outcomes throughout different simulated cyberattacks.
Analyzing DeepSeek-V3’s strengths
Regardless of its safety weaknesses, DeepSeek-V3-0324 nonetheless outperforms Qwen2.5-Max in a number of completely different benchmarks. Not like the ASR, the next rating in these exams truly signifies higher efficiency.
DeepSeek-V3-0324 | Qwen2.5-Max | |
---|---|---|
MMLU-Professional | 81.2 | 75.9 |
GPQA Diamond | 68.4 | 59.1 |
MATH-500 | 94.0 | 90.2 |
AIME 2024 | 59.4 | 39.6 |
LiveCodeBench | 49.2 | 39.2 |
Based on these benchmarks, DeepSeek-V3-0324’s strengths embrace normal language understanding (MMLU-Professional), superior matters corresponding to biology, physics, and chemistry (GPQA Diamond), arithmetic (MATH-500, AI in drugs (AIME 2024), and coding (LiveCodeBench).