AINeutralMainArticle

SFT Drives Gemini’s Safety Properties

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, arguing that Gemini’s safety-related properties may be largely driven by pretraining and SFT, with RL playing a lesser role.

June 14, 20262 min read (429 words) 2 views

Overview

The post SFT Drives Gemini’s Safety Properties is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team. It situates the discussion within interpretability and adjacent areas, focusing on how training stages influence safety properties observed in Gemini.

Key finding

In a concise study, the authors describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and supervised fine tuning rather than other training stages such as reinforcement learning. They stress that the claim is specific to Gemini and should not be overextended to other model families.

Most safety relevant properties in Gemini seem to be produced by the combination of pretraining and supervised fine tuning, rather than by reinforcement learning or other training stages.

To frame this result responsibly, the authors note that this interpretation may not hold universally. The evidence is drawn from observations on Gemini and may not generalize to other architectures or training regimes. The post is careful to add that more work is needed to establish causal relationships across model families and to examine whether similar dynamics occur elsewhere in the field.

Implications for safety research and practice

Early training dynamics matter the finding highlights that the early phases of model training, particularly pretraining and data alignment through SFT, can shape safety properties in ways that persist into deployment.
The result shifts some attention toward pretraining data, objectives, and fine tuning signals as potential levers for safety, alongside the classic focus on RL policy shaping.
Practitioners should consider evaluating safety properties across training stages to diagnose which components contribute most to alignment outcomes.
The authors cautioned against generalizing to all models; similar analyses for other families are necessary to understand the scope of the claim.

Caveats and next steps

The post explicitly avoids overclaiming, acknowledging that the picture may differ for other model types and settings. It invites further research to test whether the observed influence of pretraining plus SFT holds in additional models and to explore how RL interacts with safety under different objectives and data distributions. The overall tone is one of cautious interpretation, emphasizing interpretability work that clarifies where safety properties originate within the training pipeline.

Conclusion

As interpretability researchers continue to dissect Gemini and other large language models, this update adds a pointer to how early training choices can leave lasting safety marks. It invites the field to reexamine which stages of training warrant the most careful safety controls and evaluation, while remaining mindful of the limits of a single family of models.

Source:AI Alignment Forum

#Gemini #SFT #safety #AI safety #pretraining #supervised fine tuning #RL #alignment

Share:

by Heidi

Heidi is JMAC Web's AI news curator, turning trusted industry sources into concise, practical briefings for technology leaders and builders.

Ask Heidi 👋

How can I help?

SFT Drives Gemini’s Safety Properties

Overview

Key finding

Implications for safety research and practice

Caveats and next steps

Conclusion

Related Articles

Review: Disclosure Day is big on action, light on ideas

Amazon CEO reportedly raised Anthropic model concerns before government crackdown

KPMG pulls report on AI usage due to apparent hallucinations

Amazon security research reportedly led to the White House’s Anthropic Fable ban