Google Vision vs Space OCR
A fair, fact-checked comparison of Google Cloud Vision and space-ocr: raw text + pixel boundingPoly vs structured key-value fields with a per-value match_ratio, oriented boxes, a queryable sheet, and one HTTP call — proven with a live demo.
Google Cloud Vision is excellent raw OCR. Its TEXT_DETECTION and DOCUMENT_TEXT_DETECTION features return the full text plus a fullTextAnnotation organized as Pages › Blocks › Paragraphs › Words › Symbols, each carrying a boundingPoly and a recognition confidence. If you want every word and where it sits, Vision is a mature, scalable choice.
But there's a distinction people hit on day two: Cloud Vision returns text and geometry, not structured key-value fields. It will tell you the string Invoice No. INV-4471 is on the page, and the pixel box around each word — but it won't hand you { invoice_number: "INV-4471" }. That structured extraction is Google Document AI, a separate, processor-based product (Form Parser, the pretrained Invoice parser, and so on), which you create and configure per processor.
This guide is a fair comparison: where Vision (and Document AI) is the better choice, and where space-ocr fits — and it leads with a live demo you can check rather than a feature grid you have to trust.
Proof first: an extraction you can check
The one thing most OCR vendors won't put in front of you is an extraction where every value points back to the exact spot on the page it came from. Hover any field below — the box on the receipt is where that value was read, and each value carries a match ratio for how much of its characters were actually located on the page.

Every value carries a verified on-page location — bbox + 4-point vertices + match_ratio — on a 0–1000 normalized grid (0,0 top-left → 1000,1000 bottom-right), the same shape the live API returns. Hover a field to trace it back to the pixels it came from.
What actually differs
Vision and space-ocr both read documents and return coordinates, so a feature checkbox grid hides the real differences. They are: what shape of output you get (raw text vs. named fields), what kind of per-value score (recognition confidence vs. character coverage), which coordinate system the boxes live in, and how much you have to assemble yourself. Every cell below is a verified fact for each product — use it as a checklist against your own workload.
| Capability | Google Cloud Vision | space-ocr |
|---|---|---|
| Raw text + geometry | Yes — fullTextAnnotation (Pages › Blocks › Paragraphs › Words › Symbols), each with a boundingPoly | Yes — the structured result is built on the same detected symbols |
| Structured key-value fields | Not by default — that's a separate product, Google Document AI (Form Parser / Invoice parser) | Yes — named fields from one /ocr/fields call (a templateId or your own fields) |
| Per-value score, what kind | A recognition confidence, range [0, 1] (the model's self-reported certainty) at Page/Block/Paragraph/Word/Symbol | A match_ratio — the character coverage, the share of the value's characters actually located on the page — plus a bbox_source label |
| Coordinate system | boundingPoly vertices in the original image's pixel scale (and a vertex coordinate of 0 is omitted from the JSON) | bbox integers xmin/ymin/xmax/ymax on a 0–1000 normalized grid, plus four oriented vertices |
| Line items | Not from Vision; available via a Document AI Invoice/Form processor | An array field with children, each cell individually positioned |
| Queryable storage | You store and query results yourself | A stored sheet, queryable server-side via GET /view (where, sort, select) — no re-OCR, no charge |
| CSV export | Build it yourself from the JSON | One click — UTF-8 BOM, line items unfolded |
| CJK | Yes (Vision OCR is broadly multilingual) | Yes — Japanese, Korean, Chinese, English and more, auto-detected, in one engine |
| Setup | Google Cloud project, billing, credentials/SDK; Document AI adds creating & configuring a processor | One HTTPS call with a Bearer key; also a two-line Claude Code plugin |
About "verifiable": the coordinates aren't taken on the model's word. The language model returns each field's text — and a hint of which word tokens it used — but never the boxes themselves. The engine then character-matches that text against the symbols the vision OCR actually detected on the page, so a box lands on the real pixels those characters were found at, and each value gets a match_ratio for how much of it was located (a field is treated as a confident match at ≥ 0.85, labelled vision_symbol_match). Those token hints can be noisy — the model sometimes swaps them between repeated rows — so column- and row-consistency checks validate them rather than trusting them blindly. The point isn't that the AI can't be wrong; it's that every value is checked back against the page, with a score that says how well it matched.
Recognition confidence vs. character coverage
This is the subtlest line in the table, so it's worth being precise. Cloud Vision documents a recognition confidence — defined verbatim for a block as "Confidence of the OCR results on the block. Range [0, 1]." — present at every level from Page down to Symbol. It's the model's self-reported certainty that it read the glyphs correctly. That's genuinely useful, and it's exactly what a raw-OCR engine should expose.
space-ocr's match_ratio answers a different question: of the characters in the extracted value, how many were actually found among the symbols on the page? It's a coverage measure, not a self-assessment — at or above 0.85 the value is treated as a confident vision_symbol_match; below that it's flagged low_confidence. Neither number is "better"; they tell you different things. To be fair to Vision: it documents a recognition confidence, not that it can't character-match — space-ocr simply makes character coverage the contract for each structured value.
Coordinates: pixels-of-this-exact-image vs. a normalized grid
Vision's boundingPoly vertices are in the original image's scale — pixel coordinates of the bytes you uploaded. That's precise, but it ties every box to that exact file: resize, recompress, or mishandle EXIF rotation and your overlays drift. There's even a serialization gotcha — when a vertex's x or y is 0 it is omitted from the JSON (a 100×100 full-image poly serializes as [{}, {"x":100}, {"x":100,"y":100}, {"y":100}]), so naive overlay code has to backfill those zeros.
space-ocr returns a bbox of integers { xmin, ymin, xmax, ymax } on a 0–1000 normalized grid (0,0 top-left, 1000,1000 bottom-right), independent of the pixel dimensions. Convert when you actually render: pixel_x = bbox_x / 1000 * image_width. It also returns four ordered vertices (tl/tr/br/bl) for an oriented box that follows a tilted phone photo, and EXIF orientation is applied on load so the returned coordinates match the image you display.
curl -s https://api.space-ocr.com/ocr/fields \
-H "Authorization: Bearer $SPACE_OCR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"image": "https://example.com/invoice.jpg",
"imageType": "url",
"templateId": "invoice"
}'That one request returns named fields — no fullTextAnnotation to parse into key-value pairs, and no separate Document AI processor to create. Each value comes back with a bbox on the 0–1000 grid, four oriented vertices, a match_ratio, and a bbox_source. For the full coordinate model see an OCR API with bounding boxes; for receipts and invoices specifically see the invoice data extraction API guide. If AWS is your other option, there's also an Amazon Textract alternative write-up.
Where Google Vision (or Document AI) is the better choice
A fair comparison names where the incumbent wins. Reach for Google when:
- You need massive-scale raw OCR — every word and symbol with geometry — and you'll do the structuring downstream yourself.
- You're already deep in Google Cloud and want OCR that drops into your existing project, billing, IAM, and storage.
- Your structured-extraction needs are met by Document AI's processors — Form Parser, the pretrained Invoice/W2/ID processors, or a custom Document AI extractor trained on your own document types.
If that's you, Google is a strong fit and an alternative buys you little.
Where space-ocr fits instead
space-ocr earns its place when one or more of these matters:
- You want structured fields from one HTTP call — not raw text you parse, and not a separate processor to create and configure. Send an image, get named fields back inline.
- You want to verify, not just trust a recognition score. Every value returns with its on-page box and a
match_ratiofor character coverage, and clicking a cell highlights exactly where it was read. - You want coordinates that survive resizing. A 0–1000 normalized grid plus oriented vertices, with EXIF handled on load — no pixel boxes glued to one exact file.
- You don't want to stand up storage. Results land in a sheet you can query server-side (
GET /view) and export to CSV in one click — no database. - You build with Claude. A two-line Claude Code plugin and a dependency-free Python client.
The whole call is one HTTPS request. The engine takes raster images (JPEG, PNG, GIF, BMP, TIFF, WebP); in the web app you can drop a PDF and each page is converted to an image first.
How to try space-ocr alongside Cloud Vision
- Get a key — no Google Cloud projectSign up for the free tier (100 scans a month, no credit card) and grab your spocr_ API key. There's no project, billing account, or Document AI processor to create.
- Send the imagePOST the document to /ocr/fields with imageType 'url' or 'base64'. The engine takes raster images (JPEG, PNG, GIF, BMP, TIFF, WebP); language is detected automatically.
- Ask for structured fields directlyPass templateId 'invoice' or 'receipt' for common cases, or supply your own fields — including an array field with children for line items. No fullTextAnnotation to parse and no processor to configure.
- Verify each value against the pageRead each value's bbox (0–1000 grid), vertices, match_ratio, and bbox_source. In the app, click a cell to highlight exactly where it was read; a match_ratio below 0.85 flags a value worth a closer look.
- Query or export — no storage to buildPush images into a sheet with /upload, query it server-side with GET /view (where, sort, select) at no extra charge, or download CSV with line items unfolded — no database and no re-OCR.
Does Google Cloud Vision return structured fields like invoice number or total?
What's the difference between Vision's confidence and space-ocr's match_ratio?
Why do Vision's pixel coordinates cause overlay problems?
Do I need a Google Cloud project or Document AI processor to use space-ocr?
Does space-ocr handle Japanese, Korean, and Chinese?
See structured fields — with a checkable on-page box — on your own documents
Free tier — 100 scans a month, no credit card, no Google Cloud project. Every value comes back with its location and a match ratio.