Convert a Scanned PDF to Excel: Japanese Tables Into CSV, With No Garbled Text
A scanned PDF is just an image — you can't copy it, Japanese turns into garbled characters when you paste it into Excel, and every line item collapses into a single cell. Drop the PDF into the space-ocr app and each page is rasterized automatically; define your columns once and you get one row per page, exported as a UTF-8 BOM CSV. It opens cleanly in Excel with no garbled text, and imports straight into freee, Money Forward, and Yayoi. Failed reads aren't charged.
Open a scanned invoice or receipt PDF and the first thing you notice is this — you can't copy it. Try to select the text and nothing drags; you can't grab even a single number. The reason is simple: a scanned PDF isn't table data, it's a picture of a table. Your eyes see rows, columns, and totals lined up, but to a computer it's nothing but pixels.
So most people end up doing the same thing. Open Excel, look at the PDF, and retype it one row at a time. Or force a copy-paste and watch the result — Japanese turns into garbled characters like 譁�蟄怜喧, every line item collapses into a single cell, and the layout falls apart completely. If you searched "convert scanned PDF to Excel," "PDF table to CSV," or "fix garbled text in Excel" and landed here, that's exactly the pain — when all you really want is clean, column-aligned data you can import into your accounting software.
The short version: no retyping, no copy-paste
All you do is drop the PDF straight into the space-ocr app. The app rasterizes each page automatically and reads the values on the page as named fields. Decide your columns once (vendor, date, total, line items…) and each page stacks up as one row in a sheet, which you can finally export as a CSV.
That CSV is the whole point: it's written with a UTF-8 BOM, so Japanese opens without garbling even when you just double-click it in Excel. Line items don't collapse into one cell — they expand into their own proper rows and columns. The engine detects Japanese automatically, so there's no language to specify.
Try it in 10 seconds, no upload required
The sample below is interactive with nothing to upload. It's a real result from parsing an actual scanned receipt. Hover over any field and a box lights up showing where on the image that value was read from. Each value also carries a match ratio — what fraction of its characters were actually found on the page — so you can spot-check right there before exporting.

Every value carries a verified on-page location — bbox + 4-point vertices + match_ratio — on a 0–1000 normalized grid (0,0 top-left → 1000,1000 bottom-right), the same shape the live API returns. Hover a field to trace it back to the pixels it came from.
Why a scan won't go straight into Excel
The moment you scan paper, the file is a raster image, same as a photo. There's no embedded text, so Excel's "Get Data" and copy-paste can't pick it up as a table. That's why it takes two steps:
- Read the image and structure it (each page → a row of named fields)
- Write those rows out to CSV (in a format Excel opens cleanly)
In the space-ocr app, when you throw in a PDF it rasterizes each page to PNG and then runs OCR. So there's no manual step on your end to slice pages into images — you just drop the PDF. The hard part is step 1, the reading, and getting that right is the shortest path to wiping out garbled text and collapsed line items for good.
From source image to extraction sheet: a scan becomes "columns"
The values from the document you dropped come out not as a wall of text but as named columns. You can let the engine propose a schema, pick a built-in template (invoice, receipt, purchase order, delivery note, business card, and more), or define the columns yourself. Watch a single scan turn into a labeled row.
For documents with repeating rows — line items on an invoice, products on a receipt — you declare child columns under an array-type field. Then each row on the page becomes its own separate row, in a shape you can total in Excel. That's the answer to the "line items collapse into one cell" problem. There's more on building line items in Read invoices and delivery notes into Excel.
{
"image": "https://example.com/scan-invoice-p01.png",
"imageType": "url",
"fields": [
{ "name": "vendor", "type": "string" },
{ "name": "invoice_date", "type": "string" },
{ "name": "total", "type": "string" },
{
"name": "line_items", "type": "array",
"children": [
{ "name": "description", "type": "string" },
{ "name": "unit_price", "type": "string" },
{ "name": "qty", "type": "string" }
]
}
]
}The AI isn't inventing the coordinates from imagination. The language model returns only the value text and a hint about which word tokens it used — never the box itself. First, the engine matches each value's characters one by one against the symbols the vision OCR actually detected on the page (character-level matching is the primary path) and places the box at the real pixel location. Then it attaches a match ratio to every value — what fraction of the characters it actually found on the page. When the model does return token hints, those can override the position of some fields, but in repeating rows the hints can get mismatched, so they're not taken at face value — they're validated and corrected with column clustering and row-consistency checks. The point isn't "the AI never makes mistakes"; it's that every value is matched against the page and a score of how well it matched is kept. Coordinates come back as xmin / ymin / xmax / ymax, 0–1000 normalized (not pixels).
Spot-check before you import
Before anything goes into Excel or your accounting software, you can verify the read on the spot. Hover over a value and the matching spot lights up on the original scan, so your eye jumps straight there without re-reading the whole document. A match ratio of 1.0 means every character was found on the page; anything below 0.85 is a sign to take a second look just in case.
Export a CSV that opens in Excel
Once the values look right, export the sheet as CSV. The header row carries the column names, array fields expand into columnName.childName, and the repeating line items open out into sub-rows. The file is written with a UTF-8 BOM — and that's the single thing that lets Excel open Japanese, Chinese, and Korean correctly on a plain double-click. Any cells you fixed by hand override the original OCR value on export.
Opening it in Excel is simple — just double-click the .csv. Thanks to the BOM, Excel auto-detects it as UTF-8, with no text import wizard and no garbled characters. If you need a native workbook, just Save As → .xlsx from there.
Taking it into your accounting software works the same way: the exit is still a CSV. With freee, Money Forward, and Yayoi, you load it through each app's own CSV import feature (this is importing a CSV file, not automatic syncing through an official API). The receipts-only path to CSV is covered separately in Turn receipts into CSV.
Process in bulk: run a stack of scans through the API
When you want to process a whole folder of scans, create a sheet with your column schema once and upload the page images into it. Each image is read against that schema, appended as a row, and finally exported as a single CSV. The full request/response shapes are in the API docs.
curl -X POST https://api.space-ocr.com/upload \
-H "Authorization: Bearer $SPACE_OCR_API_KEY" \
-F "path=/請求書 2026" \
-F "files=@scan-p01.png" \
-F "files=@scan-p02.png" \
-F "wait=true"Pricing is straightforward. Pay as you go at ¥10 per page, a 100-page free tier every month (no credit card), and a flat Pro plan at $39/month for teams. And failed reads aren't charged — if no result comes out, you don't pay. Because the cost is easy to predict, you can try it on your own scans first and then decide.
How to convert a scanned PDF to Excel
- Drop the PDF into the appDrop your scanned PDF into the space-ocr app. The app rasterizes each page to PNG automatically and then runs OCR, so you don't have to slice pages into images yourself.
- Decide your columns (the schema)Set columns like vendor, date, and total using a built-in template, your own field definitions, or an automatic suggestion. For repeating line items such as invoice products, declare child columns under an array type so each row becomes its own separate row.
- Spot-check the valuesHover a field and the matching spot lights up on the original scan. A match ratio of 1.0 means every character was found on the page; below 0.85 is a sign of a value to check or fix.
- Export the CSVExport the sheet as CSV. It's written with a UTF-8 BOM, array line items expand into sub-rows, and any manual fixes override the original OCR value.
- Open in Excel or your accounting softwareDouble-click the CSV and Excel recognizes the BOM as UTF-8, opening it as clean, column-aligned rows with no garbled Japanese. If you need a native format, use Save As to get an .xlsx. For freee, Money Forward, and Yayoi, load it through each app's CSV import.
Does it handle Japanese scans too?
Does it support PDFs directly, or do I need to convert them to images?
Will the exported CSV open in Excel without garbling the Japanese?
Can I import it into freee, Money Forward, or Yayoi?
How can I check accuracy? Is my personal data safe?
Turn scanned PDFs into Excel rows
100 pages free every month, no credit card. Drop a PDF, read it into columns, and get a CSV (UTF-8 BOM) that opens straight in Excel. Failed reads aren't charged.