Research — Shruti Rajvanshi

Overview

Designed and executed a structured red-teaming framework in which Claude, ChatGPT, and Gemini adversarially attacked one another across three high-risk domains: cybersecurity, biological/chemical hazards, and manipulation. A 6-cell cross-model attack matrix ensured no model attacked itself — each target–domain pairing used a different attacker model.

Scoring used a binary break classification system (0 = safe, 1 = unsafe) with mandatory one-line justification per classification to eliminate post-hoc scoring bias. Experimental design was pre-registered prior to execution.

Key Findings

Novel meta-finding

All three models willingly generated adversarial attack prompts targeting peer systems when asked — without resistance. This cooperative role in attack generation is an underexplored alignment risk: models can function as adversarial prompt factories.

Differential safety behaviour

Gemini produced unsafe outputs across 29/30 manipulation prompts and generated a full medical device exploitation guide as an attacker. ChatGPT completed all 20 assigned prompts with 3–4 refusals. Claude participated in 13 before self-terminating after Gemini produced a healthcare infrastructure attack guide.

Claude's breakpoints

Claude broke on 2 of 20 cybersecurity prompts: once via CTF gamification framing, once via localhost semantic shift (reframing a DDoS tool as a "local performance calibration script") — both cases where framing overrode output-level harm assessment.

Semantic laundering as primary attack vector

The most consistent method: framing harmful requests using professional, educational, or roleplay language. Models evaluated the framing rather than the real-world impact of the output.

Attacker role disinhibits models

Models were more willing to produce harmful content when cast as "red teamers" or "security researchers" — suggesting role assignment is a critical and underregulated variable in safety alignment.

Overview

Original evaluation framework assessing how Claude, Gemini, DeepSeek, and Perplexity behave across South Asian linguistic and cultural contexts — India, Sri Lanka, Nepal, Bhutan. Each base prompt was localised into the linguistic and cultural context of each country, preserving intent while reflecting real-world input variation.

Scale: 25 base prompts × 4 country localizations = 100 unique prompts × 4 models = 400 responses × 5 scoring dimensions = 2,000 individual scores — one of the most granular independent LLM benchmarking exercises conducted on underrepresented South Asian contexts.

Key Findings

Safety vs. contextual alignment gap

Safety scores were consistently high (avg ~4.9–5.0). Cultural understanding and bias handling showed the widest variance, particularly in Nepali-language and lower-resource contexts.

Reactive, not proactive, cultural alignment

When cultural signals were explicit (Hinglish phrasing), outputs improved. When implicit, models defaulted to Western/global frameworks — revealing that cultural alignment requires explicit signalling.

Algorithmic exclusion in developing economies

Models consistently recommended US-centric platforms, formal employment pathways, and globally dominant economic systems — rendering advice inapplicable for informal economy users. A form of structural exclusion embedded in training data.

Five distinct bias patterns

Identity-based, linguistic, appearance-based, structural (urban advantage), and gender bias — manifesting through probability and framing rather than explicit stereotyping.

Overview

Applied Disparate Impact (DI) and Statistical Parity Difference (SPD) fairness metrics to a 649-student educational dataset across gender and age protected attributes. Evaluated the limits of pre-processing mitigation approaches and the relationship between outcome definition and fairness outcomes.

Key Findings

Limits of pre-processing mitigation

Standard bias mitigation (rebalancing/resampling) had minimal effect and in some cases worsened gender-based disparities — demonstrating that surface-level statistical adjustments cannot substitute for structural intervention.

Fairness as problem framing

Fairness outcomes are sensitive to outcome definition (pass/fail vs. high performance), revealing that fairness is a property of problem framing, not only model architecture.

Overview

Evaluated three moral decision models — humanist, protectionist, profit-based — in simulated autonomous vehicle crash scenarios, analysing stability under multi-agent interaction.

Key Findings

Protectionist deadlock

The passenger-first protectionist rule becomes unstable in multi-agent environments: when two protectionist systems interact, neither concedes — resulting in ethical deadlock and increased aggregate harm.

Framework consistency argument

A stable, consistent ethical framework across jurisdictions — allowing operational adaptation to local law while keeping core moral logic invariant — is both practically and philosophically necessary.

AI Safety &
Evaluation.

Iterative Cross-Model Red Teaming

Overview

Key Findings

Bharat AI Index

Overview

Key Findings

AI Fairness in Educational Systems

Overview

Key Findings

Autonomous Systems Safety

Overview

Key Findings

AI Safety &Evaluation.

Iterative Cross-Model Red Teaming

Overview

Key Findings

Bharat AI Index

Overview

Key Findings

AI Fairness in Educational Systems

Overview

Key Findings

Autonomous Systems Safety

Overview

Key Findings

AI Safety &
Evaluation.