space ocr
ArticlesDocs
developer

An OCR API that returns bounding boxes you can verify

Most OCR APIs return bounding boxes — but the coordinate systems differ, and a box only tells you where, not how sure. A developer's guide to OCR with source coordinates, plus a match ratio that says how much of each value was actually found on the page.

8 min read· 2026-06-25

Bounding boxes are how you verify OCR. A bare string tells you what the model thinks it read; a box tells you where on the page it read it, so you (or your reviewer, or your code) can check the value against the original instead of trusting it blind. If you're integrating OCR into anything that gets audited — invoices, expenses, KYC, records management — "the model returned total: 2,045" is not enough; you need to point at the pixels that 2,045 came from.

The good news: most mainstream OCR APIs do return bounding boxes. The catch is that they differ in three ways that matter once you start building — the coordinate system, whether you also get structured fields (not just raw text), and what the per-value confidence actually measures. This guide walks through all three, and shows what an OCR API with source coordinates plus a character-coverage match ratio looks like in practice.

Proof first: boxes you can hover

Here's a real extraction where every value points back to the exact region it was read from. Hover any field — the box on the receipt is the source of that value, and each one carries a match ratio for how much of the value's text was actually located on the page.

Source receipts with extracted-field bounding boxes
Verified fields
KINSHO · 合計 2,045
ライフ · 合計 4,286

Every value carries a verified on-page location — bbox + 4-point vertices + match_ratio — on a 0–1000 normalized grid (0,0 top-left → 1000,1000 bottom-right), the same shape the live API returns. Hover a field to trace it back to the pixels it came from.

DemoEvery field returns a bounding box and a <b>match ratio</b> — the coordinates that make an extraction checkable instead of just plausible.
Every field returns a bounding box and a match ratio — the coordinates that make an extraction checkable instead of just plausible.

Most OCR APIs return boxes — here's how they differ

Google Cloud Vision, Tesseract, Amazon Textract, and Azure AI Document Intelligence all return geometry with their text. They diverge on the coordinate system, on whether you get structured fields or only raw text + layout, and on what the confidence number means. These are verified facts, not marketing — use them to size the integration work for your stack.

APICoordinate systemStructured fieldsPer-value confidence
Google Cloud VisionboundingPoly vertices in pixels of the source imageText + geometry only (structured key-values are Google Document AI, a separate product)Recognition confidence per word/symbol (0–1)
TesseracthOCR / TSV boxes in pixels (self-hosted, no API)None — raw text + layoutRecognition confidence per word (0–100)
Amazon TextractBoundingBox normalized 0–1 of page width/height (+ Polygon)Forms/tables via AnalyzeDocument; receipts via AnalyzeExpenseRecognition confidence (%) per block
Azure Document IntelligenceBounding polygon in pixels (images) or inches (PDF)Prebuilt/custom modelsRecognition confidence per word
space-ocrbbox normalized 0–1000 (+ oriented vertices)Built-in templates + custom fields, with line itemsmatch_ratio — share of the value's characters found on the page — + bbox_source

Two things to notice. First, coordinate units aren't portable — pixel boxes from Vision/Tesseract/Azure are tied to the exact image you sent, while normalized boxes (Textract, space-ocr) survive a resize. Second, the confidence column means different things: most APIs report a recognition confidence (how sure the model is), which is not the same as measuring how much of a returned value was actually located on the page.

✓ Verified

How the box is derived matters as much as its format. With space-ocr, the language model returns each field's text — and a hint of which word tokens it used — but never the boxes themselves. The engine character-matches that text against the symbols the vision OCR actually detected on the page, so the box lands on the real pixels those characters were found at, and each value gets a match_ratio for how much of it was located. The token hints can be noisy (they sometimes swap between repeated rows), so column- and row-consistency checks validate them rather than trusting them blindly. That's the difference between a coordinate the model asserts and one that's checked back against the page.

What space-ocr returns for every value

Alongside each extracted value you get four things, so a coordinate is never just a number you have to trust:

  • bbox — an axis-aligned rectangle { xmin, ymin, xmax, ymax } of integers on a 0–1000 normalized grid (0,0 = top-left, 1000,1000 = bottom-right), independent of the image's pixel size.
  • vertices — exactly four ordered points (top-left, top-right, bottom-right, bottom-left) forming an oriented box that follows the document's tilt, so a skewed phone photo still boxes cleanly.
  • match_ratio — the fraction (0–1) of the value's characters that were actually located on the page. A field is treated as a confident match at ≥ 0.85; 1.0 means every character was found.
  • bbox_source — a label for how the coordinate was derived (e.g. vision_symbol_match for the character-matching path, low_confidence when the match ratio falls below the threshold).
one returned value
1
2
3
4
5
6
7
8
9
10
11
12
{
  "total": {
    "value": "2,045",
    "bbox": { "xmin": 381, "ymin": 803, "xmax": 500, "ymax": 825 },
    "vertices": [
      { "x": 380, "y": 804 }, { "x": 500, "y": 801 },
      { "x": 500, "y": 823 }, { "x": 381, "y": 826 }
    ],
    "match_ratio": 1.0,
    "bbox_source": "vision_symbol_match"
  }
}

Pixels or normalized? Convert once, and resizes stop breaking

A recurring OCR-integration bug is that pixel coordinates are tied to the exact image you uploaded — resize or recompress it for storage, or miss an EXIF-rotation flag, and the overlaid boxes drift, crop, or land on the wrong text. Normalized coordinates avoid that whole class of bug: a 0–1000 box maps onto any rendering of the same page.

To draw a box on a displayed image, convert once:

  • SVG overlay — give the SVG viewBox="0 0 1000 1000" and draw the bbox/vertices as-is.
  • Absolute-positioned divleftPct = xmin / 1000 * 100, topPct = ymin / 1000 * 100, widthPct = (xmax - xmin) / 1000 * 100, heightPct = (ymax - ymin) / 1000 * 100.
  • Back to pixelspixel_x = bbox_x / 1000 * image_width, pixel_y = bbox_y / 1000 * image_height.

The engine also applies EXIF orientation on load, so the coordinates it returns already match the image as displayed — a rotated phone photo (orientation 6/8) doesn't need a correction pass on your side.

request fields and get coordinates back
1
2
3
4
5
6
7
8
9
10
11
curl -s https://api.space-ocr.com/ocr/fields \
  -H "Authorization: Bearer $SPACE_OCR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "image": "https://example.com/receipt.jpg",
    "imageType": "url",
    "fields": [
      { "name": "vendor", "type": "string" },
      { "name": "total",  "type": "string" }
    ]
  }'

A confidence score and a match ratio aren't the same thing

This is the distinction worth internalizing. Most OCR APIs document a recognition confidence — a number that reflects how sure the engine is about its own reading, based on things like font clarity and image quality. It's useful, but it's the model grading its own homework. A match ratio measures something external: of the characters in the value the model returned, how many were actually found among the symbols the page-level OCR detected. A value can be returned with a recognition confidence and still not line up with anything on the page; a low match_ratio catches exactly that. Use it as a gate — sort or filter for match_ratio < 0.85 to surface the handful of values worth a human glance, instead of re-checking everything.

Verify, then query — without re-running OCR

Coordinates are most useful when the data sticks around. Push images into a sheet with POST /upload, then query it server-side with GET /viewwhere, sort, select, limit, offset — to pull, say, every row where match_ratio is low or total >= 40000, with no re-OCR and no extra charge. Each returned value keeps its bbox/vertices (or drop them with boxes=0 for a lighter payload). For the verification workflow in depth, see validating OCR with bounding boxes and the OCR audit trail.

DemoClick any value and its source region lights up on the original — the same coordinates the API returns, made interactive.
Click any value and its source region lights up on the original — the same coordinates the API returns, made interactive.

How to get verifiable bounding boxes from the API

  1. Request fields
    POST the image to /ocr/fields with imageType 'url' or 'base64', and either a templateId or your own fields array. The engine takes raster images (JPEG, PNG, GIF, BMP, TIFF, WebP).
  2. Read the coordinates
    Each value returns a bbox { xmin, ymin, xmax, ymax } on a 0–1000 grid, four oriented vertices, a match_ratio, and a bbox_source.
  3. Overlay or convert
    Draw boxes with an SVG viewBox '0 0 1000 1000', or convert to pixels with pixel_x = bbox_x / 1000 * image_width. EXIF rotation is already applied, so boxes match the displayed image.
  4. Gate on the match ratio
    Treat a match_ratio of 0.85 or above as a confident match; surface anything below for a human glance instead of re-checking every value.
  5. Store and query
    Push images into a sheet with /upload and query it with GET /view (where, sort, select) — coordinates are retained, with no re-OCR and no extra charge.
Which OCR APIs return bounding boxes?
Google Cloud Vision, Tesseract, Amazon Textract, and Azure AI Document Intelligence all return geometry with their text, as does space-ocr. They differ in the coordinate system (Vision, Tesseract, and Azure use pixels; Textract uses 0–1 normalized; space-ocr uses 0–1000 normalized), in whether structured fields come back or only raw text plus layout, and in what the per-value confidence measures.
Are the bounding-box coordinates in pixels or normalized?
space-ocr returns a bbox normalized to a 0–1000 grid, independent of the image's pixel size, plus four oriented vertices. Convert to pixels with pixel_x = bbox_x / 1000 * image_width (and the same for y), or overlay directly with an SVG viewBox of '0 0 1000 1000'. Normalized coordinates survive resizing the image, which pixel coordinates from some other engines do not.
What's the difference between an OCR confidence score and a match ratio?
A recognition confidence reflects how sure the engine is about its own reading. A match_ratio measures how much of the returned value's text was actually located among the symbols the page-level OCR detected — an external check rather than a self-report. space-ocr treats a value as a confident match at a match_ratio of 0.85 and above, so you can gate on the low ones.
Can I get oriented bounding boxes for skewed or rotated photos?
Yes. Every value returns four ordered vertices (top-left, top-right, bottom-right, bottom-left) forming an oriented box that follows the document's tilt. The engine also applies EXIF orientation on load, so the coordinates already match the displayed image — a phone photo with orientation 6 or 8 doesn't need a correction pass on your side.
Does the bounding-box OCR work for Japanese, Korean, and Chinese?
Yes. One engine handles CJK and Latin scripts with automatic language detection — no language parameter to set — and returns the same bbox, vertices, and match_ratio for every value regardless of script, including full-width characters and vertical Han text.

Get source coordinates for every value

Free tier — 100 scans a month, no credit card. Every field returns a bounding box, oriented vertices, and a match ratio.

Related