Storing scraped binary files/making POST calls within scraper

jeffreyliu · July 21, 2017, 5:11pm

Is there a recommended way for storing binary files that are downloaded by a scraper? For example: a site with a bunch of .pdfs.

The most obvious way seems to be storing it in the sqlite database as a blob object along with its associated metadata, but using sqlite to store blobs seems a little bit clumsy, and I wonder if there’s a better way to surface those files for an API to retrieve.

Additionally, is it possible/allowed for scrapers to make POSTs/uploads to external servers? I understand that there’s the webhook integration, but that doesn’t transmit any of the scraped data.

Thanks!

chris48s · July 22, 2017, 12:08pm

Storing in a BLOB is one option.

I’ve also seen a scraper someone made that downloaded files and then used the GitHub API to commit the files into a GitHub repo but I can’t find a link to it now. In principle you could apply that same approach to any third party storage bucket that has an API (so you could sync files out to dropbox or S3 or something). Remember you can use secret values to store your API keys.

As far as I know you can download/create files within the runtime of the scraper but anything you write (other than the sqlite db) won’t persist across scraper runs (it will be destroyed with the docker container the scraper runs in). That means you should be fine to download or write a file and then POST it to some API within a scraper run.

equivalentideas · July 27, 2017, 6:50am

Thanks for your question @jeffreyliu . @chris48s is spot on. You can absolutely make post calls to upload files, and using the secret values is the way to do it.

For example, here’s a scraper we use to download images, do some processing and then upload them to s3: https://morph.io/openaustralia/australian_local_councillors_images

Good luck I’m sure people would find it helpful if you’d leave a note about the approach you end up using on this thread.

All the best,

Luke