iPython Notebooks using Pandas/Numpy in scraper scripts

ingmars · August 19, 2015, 10:01am

Hi there,

I’m involved in building a platform that is about sharing data extraction and processing scripts regarding energy data time series. We would like to require our developers to use iPython notebooks, and mostly they will use Pandas / Numpy to do clean-up of the timeseries data after downloading it from the sources.

Now, I’m thinking about whether it would be possible to put these iPython Notebooks to second use by running them through Morph.io and regularly execute them as scraper scripts.

This is an example Notebook (not necessarily all best practice, but it shows the idea):

As far as I heard, there is a problem running Numpy/Pandas in scrapers, right? Do you think this would be a problem that could be solved eventually or is it a hard constraint that will never work?

Also, I heard that there are scripts which enable executing iPython Notebooks headlessly. Would that be an option for integrating them in a scraper with maybe some glue code around it?

Thanks in advance for any helpful reply or pointer!

Cheers
ingmar

otherchirps · August 21, 2015, 5:11am

I’m afraid I don’t know much about iPython Notebooks, except how great everyone says they are, so I won’t be able to help much. Maybe someone else who knows will chime in?

Not quite sure what you’re aiming to do with your notebooks in this case. You mention, “running them through Morph.io”.

Do you mean that you want to host your notebook on morph? Or that your notebook will pull data from morph, but you’re actually hosting it somewhere else?

You won’t be able to host the notebook itself on morph. Morph doesn’t provide general file hosting.
It does provide access to the data that your scraper produces, either as a sqlite/csv file, or via the api.

If you can run your notebook headlessly (specifically, with a launch script called “scraper.py”), and the output of that is written to a sqlite database called “data.sqlite”, then you should be able to run it on morph ok.

As far as there being problems running Numpy/Pandas in scrapers, I don’t know of any general problems. You can install any libraries you like, via your requirements.txt file. The only thing I can think of that might be a hiccup, would be if your number crunching needs a long time to run, or if your script eats more than 100MB of memory. Then, it’ll be killed.

Hope that’s of some help.