1. Prompt Injection
Prompt injection is the most widely discussed AI vulnerability, and it remains an open problem. It happens when an attacker crafts input that overrides the system prompt or instructions given to a language model. The model follows the injected instructions as if they came from the developer.
There are two forms. Direct injection is when the user themselves types a malicious prompt — something like "Ignore your previous instructions and do X instead." Indirect injection is more dangerous: the malicious instructions are embedded in external data the model processes, such as a webpage, email, or document. The user may have no idea the payload is there.
Indirect injection is especially concerning for AI agents that browse the web, read emails, or process uploaded files. An attacker can plant instructions in a website that gets retrieved by a RAG pipeline, and the model executes them as part of its normal workflow.
Mitigations: Input and output filtering, separating data from instructions at the architecture level, limiting the model's ability to take irreversible actions without human confirmation, and monitoring for unusual behavior patterns. There is no complete solution yet — this remains an active area of research.
2. Data Poisoning
Data poisoning targets the training pipeline. An attacker introduces carefully crafted examples into the training dataset that cause the model to learn specific, exploitable behaviors. The poisoned model performs normally on most inputs but behaves differently when it encounters a specific trigger pattern.
This is particularly relevant for models trained on web-scraped data, where an attacker can publish poisoned content that eventually gets included in a training set. It also applies to fine-tuning: if you fine-tune on user-generated data or crowdsourced datasets, malicious examples can shift the model's behavior.
Mitigations: Curate training data carefully, verify data sources, use anomaly detection on training examples, and evaluate models against adversarial test sets designed to trigger known poisoning patterns.
3. Model Theft and Weight Extraction
Trained models represent significant investment — data collection, compute costs, and engineering time. Model theft can happen in several ways: extracting weights through repeated API queries (model extraction attacks), stealing weights from insecure storage or deployment infrastructure, or insider threats.
API-based extraction works by querying the model systematically and using the outputs to train a replica (a "distilled" copy). With enough queries, the replica can closely approximate the original model's behavior. This is a real concern for companies offering proprietary models behind APIs.
Mitigations: Rate limiting and monitoring API usage patterns, watermarking model outputs, encrypting model weights at rest and in transit, and restricting access to model artifacts in your infrastructure.
4. Deepfakes and Synthetic Media
AI-generated images, audio, and video have reached a quality level where they're difficult to distinguish from authentic media. Voice cloning needs only a few seconds of sample audio. Video generation can produce realistic footage of real people saying things they never said.
The security implications are broad: social engineering attacks using cloned voices for phone-based fraud, fabricated video evidence, impersonation of executives for business email compromise, and large-scale disinformation campaigns.
Mitigations: Content provenance standards (C2PA) that cryptographically sign media at the point of capture, detection models trained to identify synthetic content, organizational policies requiring out-of-band verification for high-stakes requests, and public awareness that seeing or hearing something is no longer sufficient proof.
5. AI Agent Autonomy Risks
AI agents — models that can take actions like executing code, calling APIs, sending messages, and managing files — introduce a new class of security risk. An agent with access to production systems can cause real damage if it misinterprets instructions, gets prompt-injected, or encounters an edge case its developers didn't anticipate.
The risk scales with the agent's permissions. An agent that can read a database is one thing. An agent that can write to it, deploy code, or send emails on behalf of a user is something else entirely.
Mitigations: Principle of least privilege — give agents the minimum permissions they need. Require human approval for destructive or irreversible actions. Sandbox execution environments. Log every action for audit. Set hard limits on what an agent can do in a single session.
6. Supply Chain Attacks on Models
The ML ecosystem relies heavily on shared infrastructure: model hubs (HuggingFace), package managers (pip, npm), pre-trained weights, and community-contributed datasets. Each of these is a potential attack vector.
A compromised model uploaded to a public hub can contain serialized malicious code (pickle-based exploits), backdoored weights, or tampered tokenizers. Dependency confusion attacks can trick build systems into installing malicious packages that shadow legitimate ML libraries.
Mitigations: Verify model checksums and signatures, use safe serialization formats (safetensors over pickle), pin dependency versions, audit your ML supply chain, and prefer models from verified publishers with clear provenance.
7. Privacy and Data Leakage
Language models can memorize and regurgitate training data, including personally identifiable information, API keys, passwords, and proprietary code. This happens because large models have enough capacity to memorize specific sequences, especially those that appear multiple times in the training data.
This risk extends to fine-tuned and RAG-augmented systems. A model fine-tuned on customer support conversations might reproduce specific customer details. A RAG system might retrieve and expose documents that the querying user shouldn't have access to, if the retrieval layer lacks proper access controls.
Mitigations: Differential privacy during training, data deduplication and PII scrubbing in training pipelines, output filtering for known sensitive patterns, access controls in RAG retrieval layers, and regular auditing for memorization of sensitive content.
The Common Thread
Most of these risks share a pattern: AI systems are being given more access, more autonomy, and more trust — while the security tooling and best practices are still catching up. The models are powerful, and that power creates surface area.
The practical takeaway is to treat AI components with the same rigor you'd apply to any other part of your security posture: least privilege, input validation, output monitoring, audit logging, and defense in depth. The specific attacks are new, but the principles for defending against them are familiar.