How space-ocr is different from LLM OCR: verifiable, structured extraction
How space-ocr differs from prompting a raw LLM for OCR: every extracted value carries a verified on-page box and a match_ratio, and lands as a queryable row.
You can hand a receipt or an invoice to GPT-4o, Gemini, or Claude and ask for the total, the vendor, and the line items. Most of the time you get sensible-looking JSON back. The trouble starts when you try to trust it at volume: the model returns a string, and a string has no address. If the total comes back as 48,200, which pixels on the page did it read? Was that number actually printed on the document, or did the model fill in something plausible? With a raw LLM call you cannot answer that without re-reading the page yourself.
That gap is the whole difference between using a general-purpose LLM as an OCR tool and using space-ocr. space-ocr is not anti-LLM. Under the hood it pairs an OCR engine (Google Cloud Vision) with Gemini for structuring. What it adds is the layer built around the model: every value it returns is checked against what the OCR engine actually saw on the page, scored, and stored as a row you can query. Those are the two things a raw LLM call leaves to you — per-value provenance you can verify, and structured output you can query without standing up a database.
Raw LLM OCR vs space-ocr
| Raw LLM OCR (GPT-4o / Gemini / Claude) | space-ocr | |
|---|---|---|
| Per-value location | none, you get text only | a box (xmin, ymin, xmax, ymax on a 0–1000 grid) plus a four-point oriented quad, on every value |
| Per-value verification | none, you trust the string | match_ratio: how many of the value's characters were actually found among the page's detected symbols, with a bbox_source label |
| Check a value in context | re-read the document yourself | click a cell in the app and its exact region lights up on the original image |
| Output shape | prompt-dependent JSON that varies run to run | a fixed schema: define fields (or pass a templateId) once, every upload lands as one row |
| Storage and query | you build it | results are rows in a sheet you query with GET /view (where, sort, select, limit, offset), no re-OCR and no charge |
| Scripts | depends on the model and prompt | Japanese, Korean, Chinese, English and more, auto-detected, with no language parameter |
| Setup | your own prompt, retry, parse, and validate pipeline | one HTTPS call with a Bearer key |
Here is what "verified" actually means, because it is easy to overclaim. The language model does not emit coordinates. It returns each value's text plus word-token hints, and the engine then matches that text character by character against the symbols Google Cloud Vision detected on the page. It lands the box on those real symbols and scores the value with match_ratio, the share of the value's characters that were located among the page's symbols. A high match_ratio is a symbol-matched, confident value; a low one is labelled through bbox_source so you can route it to a person. This is not a promise that the model can never be wrong, the token hints can still drift. It means each value is checked against the page and scored, instead of taken on faith.
{
"vendor": {
"value": "ACME Trading Co.",
"bbox": { "xmin": 120, "ymin": 84, "xmax": 512, "ymax": 118 },
"vertices": [
{ "x": 120, "y": 84 }, { "x": 512, "y": 84 },
{ "x": 512, "y": 118 }, { "x": 120, "y": 118 }
],
"match_ratio": 1.0,
"bbox_source": "vision_symbol_match"
}
}The box is returned on a 0–1000 grid that is independent of the image size, so you scale it to pixels when you draw it: pixel_x = xmin / 1000 * image_width. The four-point quad follows a tilted or rotated scan, ordered top-left, top-right, bottom-right, bottom-left. A raw model reply gives you none of this, so any audit trail you want, you assemble by hand.
From values to a queryable table
A raw LLM call ends at the JSON. You still have to persist it, and the moment you want to ask "which invoices this quarter are over 40,000?" you are building a database and a query layer first. With space-ocr the result is already a row in a sheet with a fixed schema, so the querying is an API call: GET /view with where=total>=40000, sort=-invoice_date, select=vendor,total, plus limit and offset for paging. It runs server-side, does not re-run OCR, and is not charged. Export the sheet to CSV (UTF-8 with a BOM, so Japanese, Korean, and Chinese text and currency open correctly in Excel, and line-item arrays expand into their own rows).
When a raw LLM is the better tool
A general-purpose LLM is the right choice when you want a one-off read, a loose summary, or reasoning about what a document means. "What is this contract about?" is a model question, not an OCR-with-coordinates question. Reach for space-ocr when you are processing documents at volume and need each value to be verifiable, consistently structured, and queryable: accounts-payable automation, expense reconciliation, importing business cards into a CRM, or digitizing a backlog of receipts. The honest framing is that space-ocr is an LLM-backed OCR with a verification and storage layer, not a rival to the models themselves.
Billing is pay-as-you-go per scan with optional monthly plans, a monthly allowance of free scans, and no charge when a scan comes back empty. The pricing page has the current figures.