Database size limits


What are the size restrictions on the database (if any)? We’ve just guestimated that our clinical trials registry job will basically be about 3GB for data stored in sqlite as it stands today (and it will grow, albeit fairly slowly).

There are none so far but a 3GB database would be our biggest to date.

To make it manageable for you I’d suggest splitting it up if you can. It’s just that really huge SQLite databases can be hard to work with, e.g. if you want to download the database to inspect it you’ll have to download 3GB each time. Loading it on your machine would probably take a lot of resources too and make it difficult to work with.

I’d also suggest becoming a partner if you’re using that heavily :wink:

@henare of course I intend to support - we are really just a few days into checking it out, and trying to work out the lay of the land in terms of running the scraper from or hosting it ourselves.

In terms if splitting it up, we might be able to split data by year or something, but then we’d need to have multiple scraper instances for the same data, right? (one database per scraper)

No worries, I was just being cheeky :smile:

In the spirit of not prematurely optimising you could just give it a go and then refactor it if it’s not working for you.

I was a little worried about the performance of such a big SQLite database but maybe my guess is wrong? That page suggests it should be fine.

Ok. We’ll do some more work. Right now my main concern is not stressing resources. I’ll probably sync with you on that in the next weeks.

As I said earlier, once we get the backlog of data, the ongoing scraping activity would be quite modest. Clinical Trials is just one (the biggest) of quite a few medical trial registers we’ll be acquiring data from.

For us here, SQLite is a data staging area and we’ll be getting the new data out of it at scheduled intervals… It would be really great to be able to pass connection details to a postgres server somewhere and write into that instead of sqlite. Have you thought about something like this?

Actually, I guess there is nothing stopping us from doing that anyway… it just would mean that the data we scrape is not available via the API…

You’re absolutely right - you could use any external database and store the connection details in secret values… but you miss out on all the nice bits like the API so there’s not much point.

I’m looking forward to seeing how your big scraper goes! :rocket: