What's the best way to import a ton of data into a PostgreSQL database?

rubysilver

PROOP

7 months ago

I've got a 3GB json file that I need to parse, and the records in that JSON file I need to carefully parse because I have duplicates, edge cases, small intricancies, etc. For my local DB I've devised a script that works just fine. It takes around 15 minutes but in the end I've got like a couple million records. I tried executing the same script, connected to the remote DB, but it took excruciatingly long. I realized my script was way too slow so I changed it to do process things in batch and then send it to the DB. However, I think the batch size was too large and it overwhelmed the DB and it crashed. I've currently got two ideas - either I reduce the batch size significantly and try again, or I try to export data from my local DB, and import it in the remote DB which might be more effective? Is there anything I'm missing?

$10 Bounty

3 Replies

rubysilver

PROOP

7 months ago

Also the DB ran out of space, so I'm not sure how to increase the storage.

enlorik

HOBBY

7 months ago

The main problem here isn’t so much “batch size” as the pattern: doing millions of inserts over a remote connection. That’s always going to be slow and easy to kill the DB, no matter how you tweak it.

For Postgres the usual approach is:

do all the JSON parsing / dedupe locally (you already have this)
turn the cleaned records into either
- rows in a local Postgres DB, or
- a flat file (CSV)
then use Postgres’ bulk tools to move that into the remote DB

Two common ways:

COPY/\copy from CSV

Parse JSON → write data.csv → then:

psql "$REMOTE_DATABASE_URL" -c "\copy your_table(col1, col2, ...) \

FROM 'data.csv' WITH (FORMAT csv, HEADER true)"

\copy runs on your machine and streams the file into the remote DB in one go. It’s usually orders of magnitude faster than doing INSERT in a loop, even with batching.

If you have duplicate-handling logic, a nice pattern is:

\copy into a staging table
then run:

INSERT INTO final_table (col1, col2, ...)

SELECT col1, col2, ...

FROM staging

ON CONFLICT (your_unique_key) DO UPDATE

SET ...;

Let Postgres handle dedupe/upserts instead of doing everything in the script.

pg_dump / pg_restore (if you already load into a local DB)

If your local DB already has the “correct” state, you can just dump and restore:

# from local DB

pg_dump -d local_db_name --no-owner --no-privileges | psql "$REMOTE_DATABASE_URL"

That avoids re-running the JSON parsing logic against the remote DB entirely.

About “DB ran out of space”: that’s separate from performance. On Railway each Postgres instance has a fixed volume size. When it’s full, you basically have to:

drop any temp / staging tables you don’t need anymore
drop large indexes before the load and recreate them afterwards
and if the data + indexes still don’t fit, bump the volume size or create a new Postgres instance with more disk on your Pro plan and import into that

Reducing batch size will make it crash less dramatically, but it doesn’t fix the core issues: too many remote inserts and not enough disk. Bulk-loading (COPY/pg_dump → psql) into a DB with enough storage is the path of least pain here.

enlorik

HOBBY

7 months ago

If this solved it for you, you can mark it as the accepted solution so the thread is closed properly.

Welcome!