Database size limits

pwalsh · October 14, 2015, 8:45am

Hi,

What are the size restrictions on the database (if any)? We’ve just guestimated that our clinical trials registry job will basically be about 3GB for data stored in sqlite as it stands today (and it will grow, albeit fairly slowly).

henare · October 15, 2015, 1:08am

There are none so far but a 3GB database would be our biggest to date.

To make it manageable for you I’d suggest splitting it up if you can. It’s just that really huge SQLite databases can be hard to work with, e.g. if you want to download the database to inspect it you’ll have to download 3GB each time. Loading it on your machine would probably take a lot of resources too and make it difficult to work with.

I’d also suggest becoming a partner if you’re using morph.io that heavily

pwalsh · October 15, 2015, 5:18am

@henare of course I intend to support morph.io - we are really just a few days into checking it out, and trying to work out the lay of the land in terms of running the scraper from morph.io or hosting it ourselves.

In terms if splitting it up, we might be able to split data by year or something, but then we’d need to have multiple scraper instances for the same data, right? (one database per scraper)

henare · October 15, 2015, 5:36am

No worries, I was just being cheeky

In the spirit of not prematurely optimising you could just give it a go and then refactor it if it’s not working for you.

I was a little worried about the performance of such a big SQLite database but maybe my guess is wrong? That page suggests it should be fine.

pwalsh · October 15, 2015, 5:55am

Ok. We’ll do some more work. Right now my main concern is not stressing morph.io resources. I’ll probably sync with you on that in the next weeks.

As I said earlier, once we get the backlog of data, the ongoing scraping activity would be quite modest. Clinical Trials is just one (the biggest) of quite a few medical trial registers we’ll be acquiring data from.

For us here, SQLite is a data staging area and we’ll be getting the new data out of it at scheduled intervals… It would be really great to be able to pass morph.io connection details to a postgres server somewhere and write into that instead of sqlite. Have you thought about something like this?

pwalsh · October 15, 2015, 6:00am

Actually, I guess there is nothing stopping us from doing that anyway… it just would mean that the data we scrape is not available via the morph.io API…

henare · October 15, 2015, 6:38am

You’re absolutely right - you could use any external database and store the connection details in secret values… but you miss out on all the nice morph.io bits like the API so there’s not much point.

I’m looking forward to seeing how your big scraper goes!