Guide

How to convert scanned documents into CSV

Learn how to convert scanned documents into CSV: define columns once, upload a photo or scan, and each document fills a row. Export as UTF-8 CSV, Excel- and CJK-safe.

7 min read· 2026-06-25

You have a stack of paper — invoices, receipts, delivery notes — and you need them as rows in a spreadsheet. Retyping is slow and error-prone, and a generic OCR tool dumps a wall of raw text that still needs untangling into columns. The job you actually want is narrower and more useful: turn a photo or scan of a document into a clean CSV row, with the right values in the right columns, every time.

This guide walks through exactly that. You define your columns once, point space-ocr at an image, and each document fills in a row automatically. When you are done, you export the whole sheet to CSV — UTF-8 with a byte-order mark so Excel and CJK text open cleanly. No retyping, and every value can be traced back to where it sat on the page.

The shape of the workflow

Converting scanned documents into CSV breaks into four moves:

Photograph or scan your documents — raster images (JPEG, PNG, TIFF, and similar). A phone photo is fine; auto-rotation handles sideways shots.
Define your columns once — name the fields you care about (vendor, date, total, line items…). This becomes the schema every document is read against.
Upload — each image is read and its values drop into a new row under your columns. No per-document configuration.
Export CSV — download the whole sheet as <sheetName>.csv and open it anywhere.

The payoff is consistency: because the columns are fixed up front, the tenth receipt lands in the same shape as the first.

Proof first: every value knows where it came from

Before the steps, here is why this is trustworthy. Hover any field below — the box on the receipt shows the exact spot that value was read from, and each field carries a match ratio. A CSV is only useful if you can defend the numbers in it, and here every cell traces back to the source pixels.

Source receipts with extracted-field bounding boxes

Verified fields

KINSHO · 合計 2,045

ライフ · 合計 4,286

Every value carries a verified on-page location — bbox + 4-point vertices + match_ratio — on a 0–1000 normalized grid (0,0 top-left → 1000,1000 bottom-right), the same shape the live API returns. Hover a field to trace it back to the pixels it came from.

DemoUpload a document image and watch each value land under the column you defined — <b>no retyping</b>.

Upload a document image and watch each value land under the column you defined — no retyping.

How to convert scanned documents into CSV, step by step

1. Capture the documents as images

space-ocr reads raster images — JPEG, PNG, GIF, BMP, TIFF, and WebP. Photograph receipts on a desk, scan invoices to PNG, or export pages from your scanner app. Phone photos taken at an angle are fine: the engine reads EXIF orientation and corrects rotation, so a sideways shot still reads upright and the values still anchor to the right place.

2. Define your columns once

This is the step that turns OCR into a tidy table. Create a sheet with a column schema — the fields you want as CSV headers. Scalar columns are simple values (vendor, invoice_date, total); array columns capture repeating line items. You define this once, and every document you upload afterward is read against the same columns.

vendor        (string)
invoice_date  (string)
total         (string)
items         (array) → name, unit_price, qty

If you would rather not hand-build the schema, a built-in template (invoice, receipt, business_card, and others) supplies the fields and prompt for a common document type. For invoices specifically, see extracting line items from invoices.

3. Upload — rows fill themselves

With columns in place, upload your images to the sheet. Each document becomes a row: the engine reads the page and slots each value under the matching column. Drop in twenty receipts and you get twenty rows, all the same shape. Line-item arrays are kept as structured children of the row, ready to expand on export.

Values come back verbatim — 7,855 keeps its comma, full-width characters and honorifics are preserved — so your CSV mirrors what is printed rather than a reformatted guess.

DemoOne click exports the whole sheet to <b>CSV</b> — headers from your columns, line items expanded into sub-rows.

One click exports the whole sheet to CSV — headers from your columns, line items expanded into sub-rows.

4. Export to CSV

Click export and the sheet downloads as <sheetName>.csv. The header row is built directly from your schema:

A leading # column (the row index).
Each scalar column name as written.
Each array child flattened as colName.childName — so an items array with name and unit_price produces items.name and items.unit_price columns.

Rows that contain a line-item array expand into sub-rows — the parent's scalar values appear once and each line item gets its own row beneath it, so an invoice with eight lines becomes eight CSV rows under one vendor and date. The file is written as UTF-8 with a byte-order mark (BOM), which is what lets Excel — and Japanese, Korean, or Chinese text — open without mangled characters.

If you edited a cell by hand, your manual value overrides the original OCR value in the export, so corrections flow through to the CSV.

✓ Verified

The CSV is built from your columns, not guessed. Headers come from your schema (# + scalar names + array.child for line items), array rows expand into sub-rows, and the file ships as UTF-8 with a BOM so Excel and CJK text open cleanly. Manual edits override the OCR value in the export — the column shape is fixed the moment you define it, which is what makes every download predictable.

Doing it over the API

The same flow is available headlessly. Create a sheet with columns, upload images to it, then pull the structured rows with GET /view — server-side, with no OCR re-run and no charge. From there you write the CSV yourself, or fetch the sheet's CSV export. GET /view also lets you filter (where), sort, and select columns before you export, so you can ship only the rows you need.

create columns once, then upload

# 1. Create a sheet with the columns you want as CSV headers
curl -X POST https://api.space-ocr.com/create \
  -H "Authorization: Bearer $SPACE_OCR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/invoices",
    "type": "sheet",
    "name": "june-invoices",
    "columns": [
      { "name": "vendor", "type": "string" },
      { "name": "invoice_date", "type": "string" },
      { "name": "total", "type": "string" },
      { "name": "items", "type": "array",
        "children": [
          { "name": "name", "type": "string" },
          { "name": "unit_price", "type": "string" }
        ] }
    ]
  }'

# 2. Upload document images — each one fills a row
curl -X POST https://api.space-ocr.com/upload \
  -H "Authorization: Bearer $SPACE_OCR_API_KEY" \
  -F "path=/invoices/june-invoices" \
  -F "files=@invoice-01.png" \
  -F "files=@invoice-02.jpg"

Once the rows are in, GET /view returns them as structured JSON you can write straight to a CSV — or hand off to your accounting system. For a full walkthrough of the extraction endpoint and field specs, see the invoice data extraction API guide and the API docs.

Scanned PDFs and a note on inputs

space-ocr's engine works on raster images, not PDF bytes. If your documents are scanned PDFs, render each page to an image (PNG or JPEG) first, then upload those. If your goal is specifically Excel rather than CSV, the same captured-image approach applies — walk through it in convert scanned PDF to Excel. CSV is the lowest-friction target: it opens everywhere, and the BOM-prefixed UTF-8 export means no encoding surprises.

Capture the documents as images
Photograph or scan each document to a raster image (JPEG, PNG, TIFF, etc.). Phone photos are fine — EXIF auto-rotation corrects sideways shots. For scanned PDFs, render each page to an image first.
Define your columns once
Create a sheet with a column schema: scalar columns like vendor, date, and total, plus array columns for repeating line items. This becomes the CSV header and is reused for every document.
Upload the images
Upload your document images to the sheet. Each image is read and its values fill a new row under your columns automatically — no per-document setup, and values are preserved verbatim.
Export to CSV
Export the sheet. It downloads as <sheetName>.csv with a header of # plus scalar column names plus array children as colName.childName. Line-item rows expand into sub-rows, and the file is UTF-8 with a BOM so Excel and CJK text open cleanly.

How do I convert scanned documents into CSV?

Capture each document as a raster image (a phone photo or PNG scan works), define your columns once as a sheet schema, then upload the images — each document fills a row under your columns. When you're done, export the sheet and it downloads as <sheetName>.csv. The header is the # index plus your scalar column names, with array line items flattened as colName.childName.

Will the CSV open correctly in Excel, including Japanese or Chinese text?

Yes. The export is written as UTF-8 with a byte-order mark (BOM), which is exactly what Excel needs to detect the encoding. That keeps Japanese, Korean, and Chinese characters from turning into mojibake when the file is opened.

How are line items handled in the CSV?

Array (line-item) columns are flattened in the header as colName.childName — for example an items array with name and unit_price becomes items.name and items.unit_price. Rows that contain a line-item array expand into sub-rows: the parent's scalar values appear once and each line item gets its own row beneath it.

Can I convert a scanned PDF to CSV?

The engine reads raster images, not PDF bytes, so render each PDF page to an image (PNG or JPEG) first and upload those. Each page image is then read into a row just like any other scanned document, and you export the sheet to CSV as usual.

Do I have to define columns for every document?

No — you define columns once when you create the sheet, and every document you upload afterward is read against that same schema. You can also apply a built-in template (invoice, receipt, business_card, and others) to supply the fields automatically instead of building them by hand.

Turn your scanned documents into CSV

Define your columns once, upload, export. Free tier — 100 scans a month, no credit card. Every value comes back with its on-page location.

Start free API docs

Convert a Scanned PDF to Excel: Page Images to CSV

Extract Line Items From Invoices Automatically | space-ocr

API for Extracting Data From Invoices: A Developer Guide