Scraping Javascript Heavy Sites - PhantomJS version [With Python and Selenium]

ianibo · August 15, 2015, 1:45am

Hallo morph.io stars - I’ve a question/problem - 15th Aug 2014

I’m working on a scraper for a [horrible] JS heavy site - (http://library.sheffield.gov.uk/uhtbin/webcat since you asked:)). Code is here:: https://github.com/ianibo/SirsiDynixIBistroScraper. I’d really like this to grow into a generic iBistro scraper that can be pipelined into other projects like projectBlacklight - but thats getting ahead of myself.

What I’m hitting is after entering the search term and clicking the submit button, the next button .click() always returns a “Click succeeded but Load Failed. Status: ‘fail’” from PhantomJS on my local machine. A bit of digging indicates that this is a well known issue with 1.9.0-1 in the default ubuntu repo - and the stock advice is to upgrade to a later version like 1.9.8 - (https://github.com/ariya/phantomjs/issues/11443) and everything just works again…

So - Question - If I build a phantomJs executable locally and check it into my repo, then tell Selenium in my python script to use ./phantomjs over the system default do we expect it will work when the scraper runs at morph.io? I’m probably going to try this anyway - but firmly expect it not to. IF if doesn’t… Is there any way we can get a phantomjs_1_9_8 executable into the image? Bit hamstrung without it on this scraper.

Any ideas?

p.s. if I can get this going I’ll upload a python+selinium example for the phantomjs documentation section.

Cheers!
e

[editing to republish]

ianibo · August 15, 2015, 2:40am

Update : Looks like I can execute a 64bit exe that is checked into the git project >< Not sure if this is a good thing or not in the long run, for now, it gets me a step forward. Unfortunately, local and remote scraper runs now don’t look the same - local gets past the old failure, whereas running the scraper remotely gives results as follows :: https://morph.io/ianibo/SirsiDynixIBistroScraper . The same error comes back with the 1.9.2 exe and the builtin phantomjs exe - so at least it’s consistent. I’ll pick this up again tomorrow, but if anyone else has made this work with python/selenium I’d be interested in hearing your experiences!
TY!
e

[editing to republish]

ianibo · August 16, 2015, 12:25am

In case anyone was going to take a look at this - the server is down - don’t worry - it wasn’t the scraper that did it - the server is just incredibly underpowered and ill-maintained. Will get back to this once the glacial wheels have turned and the server is restarted. Have added an PhantomJS 2.0 exe to the project and the config to report different component versions - as an aid to future debugging. Any feedback on python/PhantomJS most welcome.

otherchirps · August 16, 2015, 3:58pm

Hiya,

Sounds like a hell of a corner case you’ve fallen on… I haven’t done a heap of stuff with python + phantom, but have dabbled a little.

The few that I’ve run on morph seemed to work ok with the system default phantomjs. Here’s one that’s using the python+selenium+phantomjs combo. Superficially, it doesn’t seem to be doing anything too different to your scraper. It’s not trying to call send_keys, as you are, but I wouldn’t think that should have this kind of effect.

I cloned your repo, and had a little try. Likewise, it seemed to work locally. I took your word that it wasn’t working up on morph. Instead of stepping through selenium stuff, I dropped in a library called Splinter, which is a slightly higher-level wrapper for selenium, in python. Firstly, I ran it without your custom phantom binaries, just to try the system default.

It appeared to run ok for me over here, but there’s no data saved yet (still just print statements, instead of saving the rows. I’m guessing that was next on the list?).

So… Sorry, no clues here on where that initial error you’re seeing is coming from, but maybe this is a side-step around the problem? If it works for you as well?

ianibo · August 16, 2015, 5:44pm

Wow - thats amazing - thanks so much! It seems to work perfectly for me! I’ll get to saving some data shortly - now that I know it can be made to work!

Owe you a virtual beer, thanks!

otherchirps · August 17, 2015, 2:17pm

Ah, no worries. I know how frustrating problems with phantom errors can be. The phantom error leads to a post pointing a finger at a ghostdriver problem. Then ghostdriver points the finger at some browser quirk, and then the browsers answer to no-one…

I’m glad it worked out this time.

equivalentideas · August 20, 2015, 12:09am

Nice work @otherchirps if you’ve got a basic setup for this working well, it could be a great addition to the documentation https://morph.io/documentation/scraping_javascript_sites which is currently lacking a python entry no pressure mate

otherchirps · August 20, 2015, 12:22am

Will do. I’ll try over the next day or two.

equivalentideas · August 20, 2015, 7:39am

You are a hero @othechirps .

JasonThomasData · October 15, 2015, 1:26am

@otherchirps, how did you get phantomJS installed?
Did you use the Node Packet Manager?
Did you set up a virtualenv?

I’ve seen so many ways of installing phantomJS and I’m looking for the best method, is all.

JasonThomasData · October 15, 2015, 1:35am

Of course, I realise now my question doesn’t make sense in the context of morph.
Mine is running on a VPS.

otherchirps · October 15, 2015, 3:06am

Hey @JasonThomasData - I tend to install phantomjs system-wide, separately from any python/ruby/etc project environment.

Last time I did it, I used npm to install it globally (eg. sudo npm install -g phantomjs). But… had odd problems with zombie phantoms (they refused to die when done. Ends up eating lots of memory).

You might have fewer issues like that if you go through the effort of building it yourself. I think this is the way morph installed it in their environment as well (correct me if I’m wrong on this).

henare · October 15, 2015, 3:48am

It’s just apt-get installed.

otherchirps · October 15, 2015, 4:30am

oh, lol, even better. Probably the same as the built version, I guess.