I have a scraper that has a long run time ( about 5 minutes). The output i logged to the console displays in a scattered manner on the morph console but displays fine on my system.
I would appreciate suggestions around the possible cause.
I have a scraper that has a long run time ( about 5 minutes). The output i logged to the console displays in a scattered manner on the morph console but displays fine on my system.
I would appreciate suggestions around the possible cause.
Can you provide a link to the scraper?
Yes.
https://morph.io/omorilewa/healthtools_ke
The line [Doctors Scraper] ....
gets repeated twice. Also i tried to add a progress bar, but it displays it one line at a time( In colour red). The logs are completely normal on my local machine.
I don’t really have a solution for this, but maybe an explanation:
When you run morph-cli
, it sends your script to the server, and then the server sends back chunks of terminal output which can then be rendered client-side.
I had a quick go at running your scraper with morph-cli
and I found if I ran it as-is, it would render
[Doctors Scraper]
[timestamp] - Started Scraper.
[Doctors Scraper]
[timestamp] - Started Scraper.
(as you’ve reported)
… but I found if I put a print statement in the main scraper loop, that didn’t happen. So, for example, if I add print "scraping page " + str(page_num)
in here then running morph-cli
will output
[Doctors Scraper]
[2017-07-11 18:25:23] - Started Scraper.
scraping page 1
scraping page 2
....
scraping page 234
scraping page 235
100%|##########|
and then move on to [Clinical Officers Scraper]
without repeating [Doctors Scraper]
.
…so I think what is causing the repeated output is that if enough time passes without any new console output being generated, it will just send whatever was in the last output buffer again (there’s probably a bug report to be raised there). By the looks of it, you can work around that by ensuring your scraper sends output more regularly.
It isn’t actually running your code twice on the server if that’s what you’re concerned about.
I suspect that a progress bar will always get returned one line at a time.
@chris48s.
Thanks a lot.
I would stick to adding more print statements then.
Nice detective work @chris48s I’m guessing it’s related to this issue:
The Docker client has a read timeout for console output. We set this to be 5 minutes. This means if a scraper doesn’t output anything for 5 minutes the background worker throws an exception: Docker::Error::TimeoutError: read timeout reached.
Nice workaround