With a group of major announcements made from close-source frontier models at the end of 2024, there were also notable releases in the open-source community. In this short blog post, we delve into two of the latest open-source models, Llama-3.3 and Tulu-3, to evaluate their performance from the perspectives of AI Safety and Security.
Llama-3.3 is a text-only, 70-billion parameter instruction-tuned model developed by Meta. Released on December 6, 2024, it represents a significant improvement over its predecessor, Llama-3.1, and even outperforms larger models like Llama-3.2 90B for text-only applications. Designed specifically for multilingual dialogue, Llama-3.3 supports languages such as English, German, French, Spanish, and Hindi, among others.
The model employs a refined transformer architecture with Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF) to align outputs with user preferences for safety and helpfulness. Key features include a 128k token context length and optimizations like Grouped-Query Attention (GQA) for scalable inference. Despite its focus on safety, our evaluations reveal vulnerabilities to adversarial attacks, emphasizing the need for system-level safeguards.
Tulu-3, released on November 21, 2024, is a state-of-the-art post-trained model developed by Allen Institute for AI. It builds upon open pre-trained models like Llama-3 Base, employing advanced post-training techniques such as Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR). These techniques enhance skills like coding, complex reasoning, and instruction following.
Tulu-3 offers complete transparency, releasing training datasets, decontamination scripts, evaluation tools, and other resources to the community. Available in 8B and 70B configurations, it delivers superior performance compared to other open-source models. Its innovative methodologies, such as persona-driven synthetic data generation and rigorous data curation, make Tulu-3 a cornerstone for open research in AI post-training.
Using our platform, we conducted comprehensive evaluations of Llama-3.3 and Tulu-3 by simulating diverse categories of adversarial attacks. These tests assessed their safety performance across numerous metrics, with higher scores (closer to 100) indicating better safety.
Surprisingly, while certain methodologies such as PAIR showed improved safety, both Tulu-3 8B and Llama-3.3 exhibited vulnerabilities in comparison to Llama-3.1 for many attack scenarios. These findings suggest that despite their enhanced benchmark scores, the latest models may have compromised safety in favor of higher task performance.
Below is a summary of the safety scores derived from our evaluations:
The trade-off highlighted here is not unique to Llama-3.3 or Tulu-3. Across the AI community, model builders constantly face the tension between pushing for state-of-the-art performance and maintaining robust safety measures. Cutting-edge performance often demands ever-larger models or increasingly intricate fine-tuning methods—yet scaling up can expose systems to new vulnerabilities. On the other hand, bolstering safety typically adds layers of complexity and constraints that can limit raw performance gains.
Addressing this dilemma requires a holistic view:
Model-Level Safeguards: Techniques like SFT, RLHF, and advanced post-training (DPO, RLVR) can help align models with desired ethical and safety guidelines.
System-Level Protections: Beyond the model itself, employing firewalls, validation processes, and real-time threat detection can mitigate risks from adversarial attacks.
Transparent Development: Sharing datasets, evaluation tools, and best practices fosters collective learning and innovation, enabling the community to evolve safer and stronger models together.
Ensuring AI safety cannot rely solely on the robustness of individual models. As our analysis indicates, even state-of-the-art models are susceptible to specific attack vectors. To safeguard AI systems holistically, it is imperative to implement comprehensive protective measures such as firewalls and validation processes.
Our platform provides advanced tools for validating AI safety and integrating robust security layers into your systems. By leveraging these solutions, organizations can elevate their AI implementations to meet the highest safety standards.
HydroX
, All rights reserved.