Original frameworks for adversarial red-teaming, cross-cultural LLM benchmarking, fairness analysis, and autonomous systems ethics. All work pre-registered and open on GitHub.
Designed and executed a structured red-teaming framework in which Claude, ChatGPT, and Gemini adversarially attacked one another across three high-risk domains: cybersecurity, biological/chemical hazards, and manipulation. A 6-cell cross-model attack matrix ensured no model attacked itself — each target–domain pairing used a different attacker model.
Scoring used a binary break classification system (0 = safe, 1 = unsafe) with mandatory one-line justification per classification to eliminate post-hoc scoring bias. Experimental design was pre-registered prior to execution.
All three models willingly generated adversarial attack prompts targeting peer systems when asked — without resistance. This cooperative role in attack generation is an underexplored alignment risk: models can function as adversarial prompt factories.
Gemini produced unsafe outputs across 29/30 manipulation prompts and generated a full medical device exploitation guide as an attacker. ChatGPT completed all 20 assigned prompts with 3–4 refusals. Claude participated in 13 before self-terminating after Gemini produced a healthcare infrastructure attack guide.
Claude broke on 2 of 20 cybersecurity prompts: once via CTF gamification framing, once via localhost semantic shift (reframing a DDoS tool as a "local performance calibration script") — both cases where framing overrode output-level harm assessment.
The most consistent method: framing harmful requests using professional, educational, or roleplay language. Models evaluated the framing rather than the real-world impact of the output.
Models were more willing to produce harmful content when cast as "red teamers" or "security researchers" — suggesting role assignment is a critical and underregulated variable in safety alignment.
Original evaluation framework assessing how Claude, Gemini, DeepSeek, and Perplexity behave across South Asian linguistic and cultural contexts — India, Sri Lanka, Nepal, Bhutan. Each base prompt was localised into the linguistic and cultural context of each country, preserving intent while reflecting real-world input variation.
Scale: 25 base prompts × 4 country localizations = 100 unique prompts × 4 models = 400 responses × 5 scoring dimensions = 2,000 individual scores — one of the most granular independent LLM benchmarking exercises conducted on underrepresented South Asian contexts.
Safety scores were consistently high (avg ~4.9–5.0). Cultural understanding and bias handling showed the widest variance, particularly in Nepali-language and lower-resource contexts.
When cultural signals were explicit (Hinglish phrasing), outputs improved. When implicit, models defaulted to Western/global frameworks — revealing that cultural alignment requires explicit signalling.
Models consistently recommended US-centric platforms, formal employment pathways, and globally dominant economic systems — rendering advice inapplicable for informal economy users. A form of structural exclusion embedded in training data.
Identity-based, linguistic, appearance-based, structural (urban advantage), and gender bias — manifesting through probability and framing rather than explicit stereotyping.
Applied Disparate Impact (DI) and Statistical Parity Difference (SPD) fairness metrics to a 649-student educational dataset across gender and age protected attributes. Evaluated the limits of pre-processing mitigation approaches and the relationship between outcome definition and fairness outcomes.
Standard bias mitigation (rebalancing/resampling) had minimal effect and in some cases worsened gender-based disparities — demonstrating that surface-level statistical adjustments cannot substitute for structural intervention.
Fairness outcomes are sensitive to outcome definition (pass/fail vs. high performance), revealing that fairness is a property of problem framing, not only model architecture.
Evaluated three moral decision models — humanist, protectionist, profit-based — in simulated autonomous vehicle crash scenarios, analysing stability under multi-agent interaction.
The passenger-first protectionist rule becomes unstable in multi-agent environments: when two protectionist systems interact, neither concedes — resulting in ethical deadlock and increased aggregate harm.
A stable, consistent ethical framework across jurisdictions — allowing operational adaptation to local law while keeping core moral logic invariant — is both practically and philosophically necessary.