space ocr
ArticlesDocs
comparison

Google Vision vs Space OCR

A fair, fact-checked comparison of Google Cloud Vision and space-ocr: raw text + pixel boundingPoly vs structured key-value fields with a per-value match_ratio, oriented boxes, a queryable sheet, and one HTTP call — proven with a live demo.

8 min read· 2026-06-25

Google Cloud Vision is excellent raw OCR. Its TEXT_DETECTION and DOCUMENT_TEXT_DETECTION features return the full text plus a fullTextAnnotation organized as Pages › Blocks › Paragraphs › Words › Symbols, each carrying a boundingPoly and a recognition confidence. If you want every word and where it sits, Vision is a mature, scalable choice.

But there's a distinction people hit on day two: Cloud Vision returns text and geometry, not structured key-value fields. It will tell you the string Invoice No. INV-4471 is on the page, and the pixel box around each word — but it won't hand you { invoice_number: "INV-4471" }. That structured extraction is Google Document AI, a separate, processor-based product (Form Parser, the pretrained Invoice parser, and so on), which you create and configure per processor.

This guide is a fair comparison: where Vision (and Document AI) is the better choice, and where space-ocr fits — and it leads with a live demo you can check rather than a feature grid you have to trust.

Proof first: an extraction you can check

The one thing most OCR vendors won't put in front of you is an extraction where every value points back to the exact spot on the page it came from. Hover any field below — the box on the receipt is where that value was read, and each value carries a match ratio for how much of its characters were actually located on the page.

Source receipts with extracted-field bounding boxes
Verified fields
KINSHO · 合計 2,045
ライフ · 合計 4,286

Every value carries a verified on-page location — bbox + 4-point vertices + match_ratio — on a 0–1000 normalized grid (0,0 top-left → 1000,1000 bottom-right), the same shape the live API returns. Hover a field to trace it back to the pixels it came from.

DemoEach extracted field carries its own bounding box and a <b>match ratio</b> — not just a value, but where on the page it lives and how well it matched.
Each extracted field carries its own bounding box and a match ratio — not just a value, but where on the page it lives and how well it matched.

What actually differs

Vision and space-ocr both read documents and return coordinates, so a feature checkbox grid hides the real differences. They are: what shape of output you get (raw text vs. named fields), what kind of per-value score (recognition confidence vs. character coverage), which coordinate system the boxes live in, and how much you have to assemble yourself. Every cell below is a verified fact for each product — use it as a checklist against your own workload.

CapabilityGoogle Cloud Visionspace-ocr
Raw text + geometryYes — fullTextAnnotation (Pages › Blocks › Paragraphs › Words › Symbols), each with a boundingPolyYes — the structured result is built on the same detected symbols
Structured key-value fieldsNot by default — that's a separate product, Google Document AI (Form Parser / Invoice parser)Yes — named fields from one /ocr/fields call (a templateId or your own fields)
Per-value score, what kindA recognition confidence, range [0, 1] (the model's self-reported certainty) at Page/Block/Paragraph/Word/SymbolA match_ratio — the character coverage, the share of the value's characters actually located on the page — plus a bbox_source label
Coordinate systemboundingPoly vertices in the original image's pixel scale (and a vertex coordinate of 0 is omitted from the JSON)bbox integers xmin/ymin/xmax/ymax on a 0–1000 normalized grid, plus four oriented vertices
Line itemsNot from Vision; available via a Document AI Invoice/Form processorAn array field with children, each cell individually positioned
Queryable storageYou store and query results yourselfA stored sheet, queryable server-side via GET /view (where, sort, select) — no re-OCR, no charge
CSV exportBuild it yourself from the JSONOne click — UTF-8 BOM, line items unfolded
CJKYes (Vision OCR is broadly multilingual)Yes — Japanese, Korean, Chinese, English and more, auto-detected, in one engine
SetupGoogle Cloud project, billing, credentials/SDK; Document AI adds creating & configuring a processorOne HTTPS call with a Bearer key; also a two-line Claude Code plugin
✓ Verified

About "verifiable": the coordinates aren't taken on the model's word. The language model returns each field's text — and a hint of which word tokens it used — but never the boxes themselves. The engine then character-matches that text against the symbols the vision OCR actually detected on the page, so a box lands on the real pixels those characters were found at, and each value gets a match_ratio for how much of it was located (a field is treated as a confident match at ≥ 0.85, labelled vision_symbol_match). Those token hints can be noisy — the model sometimes swaps them between repeated rows — so column- and row-consistency checks validate them rather than trusting them blindly. The point isn't that the AI can't be wrong; it's that every value is checked back against the page, with a score that says how well it matched.

Recognition confidence vs. character coverage

This is the subtlest line in the table, so it's worth being precise. Cloud Vision documents a recognition confidence — defined verbatim for a block as "Confidence of the OCR results on the block. Range [0, 1]." — present at every level from Page down to Symbol. It's the model's self-reported certainty that it read the glyphs correctly. That's genuinely useful, and it's exactly what a raw-OCR engine should expose.

space-ocr's match_ratio answers a different question: of the characters in the extracted value, how many were actually found among the symbols on the page? It's a coverage measure, not a self-assessment — at or above 0.85 the value is treated as a confident vision_symbol_match; below that it's flagged low_confidence. Neither number is "better"; they tell you different things. To be fair to Vision: it documents a recognition confidence, not that it can't character-match — space-ocr simply makes character coverage the contract for each structured value.

Coordinates: pixels-of-this-exact-image vs. a normalized grid

Vision's boundingPoly vertices are in the original image's scale — pixel coordinates of the bytes you uploaded. That's precise, but it ties every box to that exact file: resize, recompress, or mishandle EXIF rotation and your overlays drift. There's even a serialization gotcha — when a vertex's x or y is 0 it is omitted from the JSON (a 100×100 full-image poly serializes as [{}, {"x":100}, {"x":100,"y":100}, {"y":100}]), so naive overlay code has to backfill those zeros.

space-ocr returns a bbox of integers { xmin, ymin, xmax, ymax } on a 0–1000 normalized grid (0,0 top-left, 1000,1000 bottom-right), independent of the pixel dimensions. Convert when you actually render: pixel_x = bbox_x / 1000 * image_width. It also returns four ordered vertices (tl/tr/br/bl) for an oriented box that follows a tilted phone photo, and EXIF orientation is applied on load so the returned coordinates match the image you display.

extract structured fields — one call, no processor to configure
1
2
3
4
5
6
7
8
curl -s https://api.space-ocr.com/ocr/fields \
  -H "Authorization: Bearer $SPACE_OCR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "image": "https://example.com/invoice.jpg",
    "imageType": "url",
    "templateId": "invoice"
  }'

That one request returns named fields — no fullTextAnnotation to parse into key-value pairs, and no separate Document AI processor to create. Each value comes back with a bbox on the 0–1000 grid, four oriented vertices, a match_ratio, and a bbox_source. For the full coordinate model see an OCR API with bounding boxes; for receipts and invoices specifically see the invoice data extraction API guide. If AWS is your other option, there's also an Amazon Textract alternative write-up.

DemoClick any extracted cell and its exact region lights up on the original — the value-to-pixel link that raw text + boundingPoly leaves you to assemble.
Click any extracted cell and its exact region lights up on the original — the value-to-pixel link that raw text + boundingPoly leaves you to assemble.

Where Google Vision (or Document AI) is the better choice

A fair comparison names where the incumbent wins. Reach for Google when:

  • You need massive-scale raw OCR — every word and symbol with geometry — and you'll do the structuring downstream yourself.
  • You're already deep in Google Cloud and want OCR that drops into your existing project, billing, IAM, and storage.
  • Your structured-extraction needs are met by Document AI's processors — Form Parser, the pretrained Invoice/W2/ID processors, or a custom Document AI extractor trained on your own document types.

If that's you, Google is a strong fit and an alternative buys you little.

Where space-ocr fits instead

space-ocr earns its place when one or more of these matters:

  • You want structured fields from one HTTP call — not raw text you parse, and not a separate processor to create and configure. Send an image, get named fields back inline.
  • You want to verify, not just trust a recognition score. Every value returns with its on-page box and a match_ratio for character coverage, and clicking a cell highlights exactly where it was read.
  • You want coordinates that survive resizing. A 0–1000 normalized grid plus oriented vertices, with EXIF handled on load — no pixel boxes glued to one exact file.
  • You don't want to stand up storage. Results land in a sheet you can query server-side (GET /view) and export to CSV in one click — no database.
  • You build with Claude. A two-line Claude Code plugin and a dependency-free Python client.

The whole call is one HTTPS request. The engine takes raster images (JPEG, PNG, GIF, BMP, TIFF, WebP); in the web app you can drop a PDF and each page is converted to an image first.

How to try space-ocr alongside Cloud Vision

  1. Get a key — no Google Cloud project
    Sign up for the free tier (100 scans a month, no credit card) and grab your spocr_ API key. There's no project, billing account, or Document AI processor to create.
  2. Send the image
    POST the document to /ocr/fields with imageType 'url' or 'base64'. The engine takes raster images (JPEG, PNG, GIF, BMP, TIFF, WebP); language is detected automatically.
  3. Ask for structured fields directly
    Pass templateId 'invoice' or 'receipt' for common cases, or supply your own fields — including an array field with children for line items. No fullTextAnnotation to parse and no processor to configure.
  4. Verify each value against the page
    Read each value's bbox (0–1000 grid), vertices, match_ratio, and bbox_source. In the app, click a cell to highlight exactly where it was read; a match_ratio below 0.85 flags a value worth a closer look.
  5. Query or export — no storage to build
    Push images into a sheet with /upload, query it server-side with GET /view (where, sort, select) at no extra charge, or download CSV with line items unfolded — no database and no re-OCR.
Does Google Cloud Vision return structured fields like invoice number or total?
Not by default. Cloud Vision's TEXT_DETECTION and DOCUMENT_TEXT_DETECTION return the text plus a boundingPoly per word/symbol and a recognition confidence — geometry and text, not named key-value fields. Structured field extraction is Google Document AI, a separate, processor-based product (Form Parser, the pretrained Invoice parser, and so on). space-ocr returns named fields from a single /ocr/fields call, with no separate processor to create.
What's the difference between Vision's confidence and space-ocr's match_ratio?
They measure different things. Cloud Vision documents a recognition confidence in the range [0, 1] — the model's self-reported certainty it read the glyphs correctly — at Page, Block, Paragraph, Word, and Symbol levels. space-ocr's match_ratio is character coverage: the share of the extracted value's characters actually located among the symbols on the page (0.85 and above is treated as a confident match). Neither is strictly better; one rates the reading, the other rates how much of the value was found on the page.
Why do Vision's pixel coordinates cause overlay problems?
Vision's boundingPoly vertices are in the original image's pixel scale, so they're tied to the exact bytes you uploaded — resizing, recompressing, or mishandling EXIF rotation makes overlays drift. There's also a quirk where a vertex coordinate of 0 is omitted from the JSON, so naive code must backfill it. space-ocr instead returns bbox integers on a 0–1000 normalized grid plus oriented vertices, and applies EXIF orientation on load, so a box maps cleanly onto whatever size you display the image at via pixel_x = bbox_x / 1000 * image_width.
Do I need a Google Cloud project or Document AI processor to use space-ocr?
No. space-ocr is a standalone HTTP REST API at https://api.space-ocr.com. You authenticate each request with a single Bearer key — no Google Cloud project, no per-processor setup. Send an image as a URL or base64 with a templateId or your own fields, and structured values come back inline. There's also a two-line Claude Code plugin and a dependency-free Python client.
Does space-ocr handle Japanese, Korean, and Chinese?
Yes. space-ocr runs Japanese, Korean, Chinese, English, and other scripts through one engine with automatic language detection — there's no language parameter to set. It normalizes full-width and half-width characters, hyphen variants, CJK spacing, vertical Han, and mixed scripts. Cloud Vision OCR is also broadly multilingual for raw text; the difference is that space-ocr returns structured, verified fields rather than raw text plus geometry.

See structured fields — with a checkable on-page box — on your own documents

Free tier — 100 scans a month, no credit card, no Google Cloud project. Every value comes back with its location and a match ratio.

Related