Sending PDFs and Images Directly to gpt-5-nano

A few months ago I wrote about CramSandwich and called PDF parsing a nightmare. The fix I had in mind back then was a better extractor. The fix I shipped this week was deleting the extractor.

OpenAI’s chat completions API takes PDFs and images as input parts, the same way it takes text. Attach the file, the model reads it, you get the response. No pdf-parse. No OCR. No text-cleanup logic. The new content module is one function in 35 lines, and the same code path now handles PDFs, images, and pasted text.

This post covers the math that changed my mind, the GPT-5 settings I had to get right first, and the new pipeline shape.

The cost story on nano models has changed

I had assumed multi-modal input would be expensive. I had not done the math. Pricing on gpt-5-nano (as of mid-2026) makes the calculation different from what it was on GPT-4-class models.

With detail: 'low', every image is a flat 85 input tokens. At nano’s $0.05 per million input tokens, that is $0.0000043 per image. Ten textbook photos cost less than half a cent before counting output. With detail: 'high', a typical phone photo lands around 1,100 input tokens. Still under one cent for ten.

PDFs work the same way. Each page gets billed as both extracted text and a rasterized image. A 10 page PDF is comparable to ten image inputs at high detail. Still cents.

The savings on developer time were larger than any token bill. I deleted pdf-parse, the whitespace cleanup logic, the 20,000 character slice that was silently truncating long documents, and a 70-line text extraction module.

Migrating to gpt-5-nano broke things first

The savings only matter if your code runs. Two parameters that worked fine on gpt-4o-mini will throw on gpt-5-nano.

max_tokens is rejected. The new name is max_completion_tokens. Easy fix, one line.

The harder one was reasoning tokens. GPT-5 family models reserve part of the completion budget for internal reasoning, and that reasoning counts against the same max_completion_tokens ceiling. My first run with the new model returned an empty response. Logs showed finish_reason: 'length', completion_tokens: 4000, and reasoning_tokens: 4000. The model spent the entire budget thinking and had nothing left to write.

Two-part fix. Bump the budget to 16,000 to give actual output room. Set reasoning_effort: 'minimal' so the model does not deliberate over a structured JSON formatting task. Quiz generation does not need chain-of-thought reasoning, it needs reliable JSON.

{
  model: 'gpt-5-nano',
  max_completion_tokens: 16000,
  reasoning_effort: 'minimal',
  response_format: { type: 'json_object' },
}

If you are migrating from gpt-4o-mini to a GPT-5 model and your task is structured output rather than analysis, set this. Reasoning is overhead you are paying for and do not want.

One code path for three input types

The shape of the request is what matters. Where I used to send a string, I now send an array of content parts.

const userContent = [
  { type: 'text', text: securePrompt },
  {
    type: 'file',
    file: {
      filename: 'study-notes.pdf',
      file_data: `data:application/pdf;base64,${pdf.toString('base64')}`,
    },
  },
  // or for photos
  {
    type: 'image_url',
    image_url: {
      url: `data:image/jpeg;base64,${img.toString('base64')}`,
      detail: 'low',
    },
  },
];

The text part carries the instructions, the file parts carry the source material.

This collapses three input types into one route. Paste text, upload a PDF, snap photos of a textbook page. The route builds a QuizPart[] array, the quiz module turns each part into the right content element, the model handles the rest. Adding the photo upload feature was an afternoon of frontend work because the backend already knew what to do with images.

Page slicing still happens, just earlier

The old pipeline ran pdf-parse with a page range. The new one slices the PDF buffer with pdf-lib before sending it to the model. Free users get 3 pages, paid users get 30. The slice happens server-side based on the user’s plan and the start and end page they typed in. The model sees a smaller PDF.

const sliced = await slicePdf(pdfBuffer, {
  startPage: 2,
  endPage: 5,
  maxPages: user.subscriptionStatus === 'free' ? 3 : 30,
});

The clamping logic is the boring kind of code that turns out to be load-bearing. If start is greater than end, swap them. If end is past the document, clamp it. If the requested range exceeds the plan limit, truncate from the end. None of this is interesting until you skip it and a free user uploads a 200 page textbook.

Image moderation goes upstream of the quiz call

User-uploaded photos need a content check. The model itself will refuse explicit content, but you have already paid for the input tokens by the time it refuses, and the error UX is bad. The fix is OpenAI’s omni-moderation-latest endpoint. It is free, accepts images, and runs in a single request that batches every uploaded image at once.

const moderation = await openai.moderations.create({
  model: 'omni-moderation-latest',
  input: images.map(img => ({
    type: 'image_url',
    image_url: { url: `data:${img.mime};base64,${img.buffer.toString('base64')}` },
  })),
});

If anything flags, the request 400s before the quiz call ever happens. Categories come back specific enough to tell the user which image was rejected and why.

I run this fail-closed. If the moderation endpoint itself errors, the request fails. Blocking legitimate uploads during a moderation outage is a worse outcome than letting flagged content reach the model behind it.

What changed

The new pipeline is shorter and should handle PDFs the old one mangled, including diagrams, equations, and unusual layouts. Photos work for free.

Run the numbers on whatever model tier you are using. If it is nano-class, the cost story has probably already been answered for you.