Scraper using Splinter works locally but not on Morph

carysmills · January 27, 2017, 3:23pm

Hi there,

I switched my scraper from using just Selenium to Splinter so I could scrape a site using JavaScript. It’s working locally and saving to the database. But when I run it on Morph it tries to run for (literally) hours if I don’t stop it and it then eventually fails.

The really weird thing is that it says it’s visiting 18 pages and that it’s scraping rcs.ncc-ccn.ca, www.youtube.com, assets.juicer.io, and 1 other domain. It should only be scraping the first one. Can anyone help me figure out what’s going on?

Thanks,
Carys

lowndsy · January 27, 2017, 4:20pm

Not sure if this is happening for you but I very occasionally have a scraper say it is scraping 80 pages when it should only be pulling one. It tends to be when the server is struggling and the doesn’t do any harm - the next day things are fine.

chris48s · January 27, 2017, 7:20pm

Scrapers reporting that they have scraped pages they haven’t is a known bug: https://github.com/openaustralia/morph/issues/1078 - happens to mine all the time.

If your scraper runs locally, the fact that it is running for hours/failing may be related to the ongoing queuing and disk space issues discussed in My scraper stuck, what to do? and Scrapers failing with status code 128 and 255 rather than an issue with your scraper code.

carysmills · January 27, 2017, 7:27pm

Thanks! Do you think it could be the ongoing issues even though running a more basic test scraper worked during the times I was having these issues?

chris48s · January 29, 2017, 4:08pm

Hard to say, but looking at it now, it says it has been queued for 2 days. As I understand it, if a scraper has actually legitimately been running for 24 hours, it gets killed off so I think that means it is stuck. If you ask @henare nicely (or badger him every few days in my case) he will be able to kill it and you can see if you can get a successful run out of it. Fingers crossed…

carysmills · January 29, 2017, 4:22pm

@henare are you able to kill my scraper? Do you think the issue is that it’s stuck?

chris48s · January 30, 2017, 9:52pm

Just had a look at your scraper and now you’ve got a clean run, it is failing with
ImportError: No module named splinter
If you want to use third party libraries, you’ll need to add a requirements.txt file to your repository with any dependencies your scraper needs to run. That will allow morph to pull in the dependencies you need before your scraper runs and that should sort your problem. Good luck.

carysmills · January 31, 2017, 12:47pm

Thanks. It’s because I’m running my scraper elsewhere now because I had so many problems. Previously had a requirements doc