How space-ocr is different from LLM OCR: verifiable, structured extraction

How space-ocr differs from prompting a raw LLM for OCR: every extracted value carries a verified on-page box and a match_ratio, and lands as a queryable row.

9 min read· 2026-07-05

You can hand a receipt or an invoice to GPT-4o, Gemini, or Claude and ask for the total, the vendor, and the line items. Most of the time you get sensible-looking JSON back. The trouble starts when you try to trust it at volume: the model returns a string, and a string has no address. If the total comes back as 48,200, which pixels on the page did it read? Was that number actually printed on the document, or did the model fill in something plausible? With a raw LLM call you cannot answer that without re-reading the page yourself.

That gap is the whole difference between using a general-purpose LLM as an OCR tool and using space-ocr. space-ocr is not anti-LLM. Under the hood it pairs an OCR engine (Google Cloud Vision) with Gemini for structuring. What it adds is the layer built around the model: every value it returns is checked against what the OCR engine actually saw on the page, scored, and stored as a row you can query. Those are the two things a raw LLM call leaves to you — per-value provenance you can verify, and structured output you can query without standing up a database.

Raw LLM OCR vs space-ocr

	Raw LLM OCR (GPT-4o / Gemini / Claude)	space-ocr
Per-value location	none, you get text only	a box (xmin, ymin, xmax, ymax on a 0–1000 grid) plus a four-point oriented quad, on every value
Per-value verification	none, you trust the string	match_ratio: how many of the value's characters were actually found among the page's detected symbols, with a bbox_source label
Check a value in context	re-read the document yourself	click a cell in the app and its exact region lights up on the original image
Output shape	prompt-dependent JSON that varies run to run	a fixed schema: define fields (or pass a templateId) once, every upload lands as one row
Storage and query	you build it	results are rows in a sheet you query with GET /view (where, sort, select, limit, offset), no re-OCR and no charge
Scripts	depends on the model and prompt	Japanese, Korean, Chinese, English and more, auto-detected, with no language parameter
Setup	your own prompt, retry, parse, and validate pipeline	one HTTPS call with a Bearer key

✓ Verified

Here is what "verified" actually means, because it is easy to overclaim. The language model does not emit coordinates. It returns each value's text plus word-token hints, and the engine then matches that text character by character against the symbols Google Cloud Vision detected on the page. It lands the box on those real symbols and scores the value with match_ratio, the share of the value's characters that were located among the page's symbols. A high match_ratio is a symbol-matched, confident value; a low one is labelled through bbox_source so you can route it to a person. This is not a promise that the model can never be wrong, the token hints can still drift. It means each value is checked against the page and scored, instead of taken on faith.

POST /ocr/fields — one field in the response

{
  "vendor": {
    "value": "ACME Trading Co.",
    "bbox": { "xmin": 120, "ymin": 84, "xmax": 512, "ymax": 118 },
    "vertices": [
      { "x": 120, "y": 84 }, { "x": 512, "y": 84 },
      { "x": 512, "y": 118 }, { "x": 120, "y": 118 }
    ],
    "match_ratio": 1.0,
    "bbox_source": "vision_symbol_match"
  }
}

The box is returned on a 0–1000 grid that is independent of the image size, so you scale it to pixels when you draw it: pixel_x = xmin / 1000 * image_width. The four-point quad follows a tilted or rotated scan, ordered top-left, top-right, bottom-right, bottom-left. A raw model reply gives you none of this, so any audit trail you want, you assemble by hand.

From values to a queryable table

A raw LLM call ends at the JSON. You still have to persist it, and the moment you want to ask "which invoices this quarter are over 40,000?" you are building a database and a query layer first. With space-ocr the result is already a row in a sheet with a fixed schema, so the querying is an API call: GET /view with where=total>=40000, sort=-invoice_date, select=vendor,total, plus limit and offset for paging. It runs server-side, does not re-run OCR, and is not charged. Export the sheet to CSV (UTF-8 with a BOM, so Japanese, Korean, and Chinese text and currency open correctly in Excel, and line-item arrays expand into their own rows).

When a raw LLM is the better tool

A general-purpose LLM is the right choice when you want a one-off read, a loose summary, or reasoning about what a document means. "What is this contract about?" is a model question, not an OCR-with-coordinates question. Reach for space-ocr when you are processing documents at volume and need each value to be verifiable, consistently structured, and queryable: accounts-payable automation, expense reconciliation, importing business cards into a CRM, or digitizing a backlog of receipts. The honest framing is that space-ocr is an LLM-backed OCR with a verification and storage layer, not a rival to the models themselves.

Billing is pay-as-you-go per scan with optional monthly plans, a monthly allowance of free scans, and no charge when a scan comes back empty. The pricing page has the current figures.

Does space-ocr replace GPT-4o or Gemini for OCR?

No. space-ocr uses an LLM under the hood, pairing Google Cloud Vision for OCR with Gemini for structuring. The difference is the layer around the model: each value is matched back to the page's symbols and scored, and the result is stored as a queryable row. It is an LLM-backed OCR, not a replacement for the models.

How do I know a value was not hallucinated?

Every value comes with a match_ratio, the share of its characters that were actually located among the page's detected symbols, plus a bbox_source label. A low match_ratio is flagged so you can send it to a person. It is verification and scoring against the real page, not a guarantee that the model can never err.

What coordinate format does space-ocr return?

A bounding box as four integers, xmin, ymin, xmax, ymax, on a normalized 0–1000 grid, plus a four-point oriented quad (vertices) for tilted or rotated pages. Convert to pixels by scaling, for example pixel_x = xmin / 1000 * image_width.

Can I query the extracted data without my own database?

Yes. Results are rows in a sheet, and GET /view filters them server-side with where, sort, select, limit, and offset (for example where=total>=40000). It does not re-run OCR and is not charged, and you can export the sheet to CSV.

Do I have to tell space-ocr the document's language?

No. Language is auto-detected across Japanese, Korean, Chinese, English and more. There is no language parameter to set.

Can it read a PDF?

The web app rasterizes each PDF page to an image and then OCRs it. The API itself takes raster images (JPEG, PNG, GIF, BMP, TIFF, WebP), one image per call, so convert PDF pages to images before sending them.

How do I call it?

Send one HTTPS request to POST /ocr/fields with a Bearer spocr_ key, passing the image plus either a templateId or a fields schema. The response carries each value together with its box and its match_ratio.