File Uploads: What Could Go Wrong?

Pavel Stanoev
December 7, 2024
18 views
10 min read
#security#backend#owasp#file-upload
Last updated: December 7, 2024

So you need to let users upload files. Profile pictures, maybe? Documents? "How hard could it be?" you think. Just accept the file, check if it's a JPEG, save it somewhere. Ship it Friday, celebrate with the team.

Congratulations, you've just opened a portal straight to hell.

The OWASP Wake-Up Call

Let's talk about the OWASP File Upload Cheat Sheet. If you haven't read it, bookmark this blog and go read it right now. I'll wait.

Back? Good. Scared? You should be.

OWASP maintains this cheat sheet because developers keep making the same mistakes, and attackers keep loving them for it.

Client-Side Validation: The Security Theater

Let's start with everyone's favorite: client-side validation. You know, that thing that makes you feel like you're doing security while actually doing UX.

// Basic client-side validation in React
function FileUpload() {
  const handleFileChange = (e: React.ChangeEvent<HTMLInputElement>) => {
    const file = e.target.files?.[0];
    if (!file) return;

    // File type check
    const allowedTypes = ['image/jpeg', 'image/png', 'image/webp'];
    if (!allowedTypes.includes(file.type)) {
      alert('Only JPEG, PNG, and WebP images are allowed');
      return;
    }

    // File size check (5MB limit)
    const maxSize = 5 * 1024 * 1024;
    if (file.size > maxSize) {
      alert('File size must be less than 5MB');
      return;
    }

    // Check image dimensions
    const img = new Image();
    img.onload = () => {
      if (img.width > 2000 || img.height > 2000) {
        alert('Image dimensions must be less than 2000x2000');
      } else {
        uploadFile(file);
      }
    };
    img.src = URL.createObjectURL(file);
  };

  return <input type="file" accept="image/*" onChange={handleFileChange} />;
}

This is nice for users! They get instant feedback. They don't accidentally upload their tax returns instead of their profile picture. It's friendly. It's helpful. It's completely worthless for security.

Why? Because attackers don't use your carefully crafted React component. They are calling your API directly.

That beautiful validation you wrote? Attackers are laughing at it while uploading definitely-not-malware.exe.jpg.

Client-side validation is for honest users making honest mistakes. But security? Nah.

Server-Side Validation: Where We Pretend We Know What We're Doing

Okay, NOW we're doing security. Everything you did on the client side? Do it again on the server. Every. Single. Check. I don't care if it feels redundant. That's the point.

Extension Validation (The "I Checked The Filename" Approach)

# Python/Flask example - This alone is basically waving at hackers
def validate_extension(filename):
    allowed_extensions = {'.jpg', '.jpeg', '.png', '.webp'}
    ext = os.path.splitext(filename)[1].lower()
    return ext in allowed_extensions

"Look, it has .jpg in the name, must be safe!" - Developer who's about to learn a valuable lesson

Attackers have been bypassing this since before you and I were writing code:

  • Double extensions: malware.php.jpg (which file extension wins? Spoiler: not the one you want)
  • Null bytes: malware.php%00.jpg (that .jpg gets truncated and whoops, you're executing PHP)
  • Case variations: malware.PhP (because apparently computers can't read)
  • Unicode tricks: malware.php with invisible zero-width characters (yes, really)

File extensions are suggestions. Suggestions made by the attacker. Would you take security advice from someone trying to hack you? Then why are you trusting their filename?

MIME Type Validation (AKA: Trusting The Liar)

# Checking Content-Type header - This is what amateurs do
content_type = request.headers.get('Content-Type')
if content_type not in ['image/jpeg', 'image/png']:
    return 'Invalid file type', 400

Oh sweet summer child. The Content-Type header is sent by the client. You know, the client we JUST established is controlled by the attacker who wants to ruin your day?

They can set it to image/jpeg. They can set it to image/cute-puppy. They can set it to definitely-not-malware/promise. The HTTP request doesn't care about your feelings.

The slightly better approach - actually reading the file content:

import magic

def validate_mime_type(file_path):
    # Use python-magic to detect ACTUAL MIME type from file content
    mime = magic.Magic(mime=True)
    detected_mime = mime.from_file(file_path)
    
    allowed_mimes = ['image/jpeg', 'image/png', 'image/webp']
    
    if detected_mime not in allowed_mimes:
        return f'Invalid MIME type: {detected_mime}', 400
    
    return detected_mime, 200

This is better because it reads the actual file content instead of trusting headers. But even this isn't bulletproof - an executable pretending to be an image can fool MIME detection. A PHP web shell cosplaying as a JPEG? Still possible. MIME types are just educated guesses with better accuracy.

File Signature Validation: Finally, Some Actual Security

Okay, NOW we're cooking. Every file type has a unique signature (called "magic bytes" because apparently security people are wizards) at the beginning. JPEG files start with FF D8 FF, PNG files start with 89 50 4E 47.

This is actually harder to fake:

def validate_file_signature(file_stream):
    # Read first bytes
    header = file_stream.read(12)
    file_stream.seek(0)  # Reset stream
    
    # JPEG signatures
    if header[:3] == b'\xFF\xD8\xFF':
        return 'jpeg'
    
    # PNG signature
    if header[:8] == b'\x89PNG\r\n\x1a\n':
        return 'png'
    
    # WebP signature
    if header[:4] == b'RIFF' and header[8:12] == b'WEBP':
        return 'webp'
    
    return None

# Usage
actual_type = validate_file_signature(file)
if not actual_type:
    return 'Invalid file format', 400

This is much better! We're actually checking what the file IS, not what it CLAIMS to be.

But of course, attackers have a counter-move. There's a thing called polyglot files - files that are somehow valid in multiple formats at the same time (like someone who speaks three languages, except evil). An attacker can craft a file that passes your JPEG validation AND contains executable code.

Because file formats are complicated, and where there's complexity, there's exploitation.

Content Scanning: Trust Issues, The Function

Even after checking magic bytes, you should run the file through image libraries and re-encode it. Think of it as putting the file through a car wash, except instead of removing dirt, you're removing malware:

from PIL import Image
from io import BytesIO

def sanitize_image(file_stream):
    try:
        # Open and validate the image
        img = Image.open(file_stream)
        img.verify()  # Verify it's actually an image
        
        # Reopen (verify() closes the file)
        file_stream.seek(0)
        img = Image.open(file_stream)
        
        # Re-encode to strip metadata and potential exploits
        output = BytesIO()
        img.save(output, format='JPEG', quality=85)
        output.seek(0)
        
        return output
    except Exception as e:
        raise ValueError(f'Invalid image file: {str(e)}')

This strips out EXIF data, comments, and any embedded nasties. It's like photocopying a photocopied document - some information gets lost in translation. The good kind of information loss.

But plot twist: even image processing libraries have vulnerabilities. Remember ImageTragick? Yeah. The tools meant to protect you can become attack vectors. Security is a nightmare and we're all just doing our best.

Filename Safety: Path Traversal is a Helluva Drug

User-provided filenames are basically attack vectors with extra steps. Check out this totally innocent filename:

../../etc/passwd

If you use this directly in a file path, congratulations! You've just let an attacker read (or overwrite) system files. Your /etc/passwd is now their /etc/passwd.

Or how about this beauty:

; rm -rf / ;.jpg

If this filename ends up in a shell command somewhere in your stack, you're not just having a bad day. You're having a "update your resume" kind of day.

NEVER. USE. USER. FILENAMES. I don't care if it makes the UX slightly worse. I don't care if your PM wants users to see their original filename. Generate your own:

import uuid
import os

def generate_safe_filename(original_filename):
    # Extract extension (AFTER validation, not before!)
    ext = os.path.splitext(original_filename)[1].lower()
    
    # Generate UUID
    unique_name = str(uuid.uuid4())
    
    return f"{unique_name}{ext}"

# Store mapping in database if you really need to show original names
# user_filename -> generated_filename

Let the user see their original filename in the UI if you must. Store it in the database. But on disk? UUID or bust.

Storage Location: Location, Location, Exploitation

Where you store files is the difference between "we had a security incident" and "we're on the front page of Hacker News (not in a good way)".

❌ The "I Like To Live Dangerously" Approach: Storing files in your web application directory

/var/www/app/uploads/user_file.php
# If the webserver executes this... narrator: it will

✅ The "I Read The Manual" Approach: Store files outside the webroot

/var/file_storage/uploads/uuid-123.jpg
# Webserver can't execute these directly. Attacker sad.

✅ The "I Have AWS Credits" Approach: Use cloud storage

// Supabase example - let someone else deal with security
const { data, error } = await supabase.storage
  .from('user-uploads')
  .upload(`${userId}/${uuid}.jpg`, file, {
    cacheControl: '3600',
    upsert: false
  });

And for the love of all that is holy, set proper permissions:

  • Readable by the application only
  • Never directly executable (seriously, why would you even...)
  • Served through a handler that validates authorization

If your files have execute permissions "just in case", I have questions. Mainly: why do you hate your future self?

Size Limits and Denial of Service (Or: How I Learned To Stop Worrying and Love Rate Limiting)

Attackers don't just want to hack you. Sometimes they just want to ruin your day by filling up your entire server with garbage:

# Set hard limits or cry later
MAX_FILE_SIZE = 5 * 1024 * 1024  # 5MB
MAX_FILES_PER_HOUR = 10  # Stop. Uploading. Things.
MAX_TOTAL_STORAGE_PER_USER = 50 * 1024 * 1024  # 50MB total

@app.route('/upload', methods=['POST'])
def upload():
    # Check size BEFORE reading the entire 50GB "image"
    content_length = request.content_length
    if content_length and content_length > MAX_FILE_SIZE:
        return 'File too large', 413
    
    # Rate limiting - because some people have no chill
    user_uploads = get_user_upload_count(user_id, hours=1)
    if user_uploads >= MAX_FILES_PER_HOUR:
        return 'Rate limit exceeded', 429

And then there's zip bombs. Oh boy. A 42KB file that decompresses to 4.5 petabytes. It's like the TARDIS of malicious files - bigger on the inside. Always check decompressed size:

import zipfile

def safe_extract(zip_path, max_size=100 * 1024 * 1024):  # 100MB
    total_size = 0
    with zipfile.ZipFile(zip_path) as zf:
        for member in zf.namelist():
            total_size += member.file_size
            if total_size > max_size:
                raise ValueError('Nice try. Decompressed size exceeds limit.')
        
        # Okay, it's probably safe
        zf.extractall()

Don't be the person who explains to their boss why AWS charged $50,000 for storage this month.

The Plot Twists You Didn't See Coming

SVG: The Image File That Betrayed Us

SVG files are images, right? They go in <img> tags. They're just vectors and paths and... wait, what's that <script> tag doing there?

<svg xmlns="http://www.w3.org/2000/svg">
  <script>
    alert('Surprise! XSS via SVG!');
    // Or: steal session tokens, perform actions as the user, general chaos
  </script>
</svg>

SVG files are XML. XML can contain JavaScript. If you allow SVG uploads and serve them with Content-Type: image/svg+xml, congratulations on your new XSS vulnerability! Your security team will love this.

Solutions (pick your poison):

  • Don't allow SVGs (easiest, but designers will cry)
  • Sanitize them with a proper SVG sanitizer library (complex, but works)
  • Serve them as Content-Type: text/plain with X-Content-Type-Options: nosniff (they won't render, but they also won't execute)

There's no good answer here. SVG is the file format equivalent of "we have security at home".

The Metadata Attack (EXIF: Extended File Information... and Exploits)

Image files contain EXIF metadata - GPS coordinates, camera model, the date your iPhone decided to save photos with the wrong timezone. But also: malicious payloads that exploit parser vulnerabilities.

# Strip all metadata - burn it with fire
from PIL import Image

def strip_exif(image_path):
    image = Image.open(image_path)
    data = list(image.getdata())
    clean_image = Image.new(image.mode, image.size)
    clean_image.putdata(data)
    clean_image.save(image_path)

Is this overkill? Maybe. Is it necessary? Yes. Metadata is where attackers hide their fun surprises.

Command Injection via Filenames (Because We Haven't Learned)

If you're processing files with system commands, you're in dangerous territory:

# This is basically asking to get hacked
os.system(f"convert {filename} -resize 800x600 {output}")

# Attacker uploads: "image.jpg; rm -rf /"
# Your server executes: convert image.jpg; rm -rf / -resize 800x600 output
# Your career status: ???

Always use parameterized commands. Always. Always:

# This is how adults do it
subprocess.run([
    'convert', 
    filename, 
    '-resize', 
    '800x600', 
    output
], check=True)

The extra syntax is annoying. Know what's more annoying? Explaining why the server's root directory is empty.

You're STILL Not Secure (Sorry)

Here's the part where I ruin your day: even if you implement EVERYTHING I've mentioned - extension validation, MIME checking, signature verification, content sanitization, safe storage, rate limiting, proper error handling, security headers, and a sacrificial offering to the security gods - you're still not fully protected.

Why? Because:

  1. Image processing libraries discover new vulnerabilities constantly - That PIL library you're using? It might have a zero-day. ImageMagick? More like ImageTragick (yes, that's a real CVE).
  2. File formats are stupidly complex - The JPEG spec alone could be used as a doorstop. Complexity breeds bugs. Bugs breed exploits.
  3. Zero-day exploits exist RIGHT NOW - Somewhere, an attacker knows about a vulnerability that your validation can't catch. Because it's not public yet.
  4. Attackers are more creative than you - They have time, motivation, and an unhealthy obsession with breaking your stuff.

This is where external security services enter the chat. These are the specialized tools that make security their entire job:

  • ClamAV: Open-source antivirus scanning (free, but you get what you pay for)
  • VirusTotal API: Scan files against 70+ antivirus engines (now we're talking)
  • AWS S3 Malware Protection / Azure Blob Storage scanning: Let cloud providers do the heavy lifting (expensive but effective)
  • Content Disarm and Reconstruction (CDR): Services that literally rebuild files from scratch, keeping only the content and yeeting everything else
import requests

def scan_with_virustotal(file_path, api_key):
    url = 'https://www.virustotal.com/api/v3/files'
    
    with open(file_path, 'rb') as f:
        files = {'file': f}
        headers = {'x-apikey': api_key}
        response = requests.post(url, files=files, headers=headers)
    
    if response.status_code == 200:
        analysis_id = response.json()['data']['id']
        # Poll for results... (not shown because this is already long)
        return check_analysis(analysis_id, api_key)

These services provide:

  • Behavioral analysis: Run files in sandboxes and watch what they try to do
  • Signature databases: Match against millions of known malware signatures
  • Heuristic detection: "This file is doing something weird" algorithms
  • Machine learning: AI looking for patterns (yes, AI is actually useful for something)

The Bottom Line

File uploads are a minefield wearing a "kick me" sign. Every validation layer you skip is a door you left open with a welcome mat. Every "it's probably fine" is a future post-mortem.

The OWASP cheat sheet isn't paranoia - it's documented history. These attacks don't just "could happen". They do happen. They're happening right now to someone who thought they were being careful.

Your client-side validation is UX theater. Your server-side validation alone is security theater. Even perfect validation isn't enough without external scanning watching your back.

You need defense in depth. You need to trust nothing. You need to validate everything. And even then, you need a plan for when something sneaks through.

Because in security, it's not "if something goes wrong". It's "when".

Now go forth and implement all of this. Your future self (and your security team) will thank you.