Chrome headless scrapers appear broken

andylolz · October 22, 2018, 7:28am

A scraper of mine seems to have broken at the same time as this update

It fails with a:

selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed
(Driver info: chromedriver=2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7),platform=Linux 4.17.17-x86_64-linode116 x86_64)

It’s using chromedriver in python, as described here:
https://morph.io/documentation/scraping_javascript_sites

The examples on the docs page above also seem to have broken at the same time, e.g.:
https://morph.io/wfdd/inatsisartut-scraper/history

I just tried re-running the really basic example, and that broke in the same way:

So all this seems to suggest it’s related to this update. What can be done to get this working again? Thanks!

jamezpolley · November 9, 2018, 12:54am

I’ve opened #1201 about this issue

jamezpolley · January 8, 2019, 7:17am

@andylolz Just a note to let you know I’m still working on this.

Are you able to point me at the particular error you mentioned above? I’m seeing a different error on https://morph.io/andylolz/example_ruby_chrome_headless_scraper, and https://morph.io/wfdd/inatsisartut-scraper/history seems to be mostly working (but my fork of it fails with a different error)

/app/vendor/ruby-2.5.0/lib/ruby/2.5.0/net/protocol.rb:181:in `rbuf_fill': Net::ReadTimeout (Net::ReadTimeout)
 	from /app/vendor/ruby-2.5.0/lib/ruby/2.5.0/net/protocol.rb:157:in `readuntil'
 	from /app/vendor/ruby-2.5.0/lib/ruby/2.5.0/net/protocol.rb:167:in `readline'
 	from /app/vendor/ruby-2.5.0/lib/ruby/2.5.0/net/http/response.rb:40:in `read_status_line'

I can reproduce that error in my test environment, but it’s not the error you originally reported. I want to check that I’m looking at the right problem.

andylolz · January 8, 2019, 11:03am

Hi @jamezpolley,

Thanks for your work on this!

I think the reason wfdd/inatsisartut-scraper is now (more or less) working is because it was updated to use phantomjs (presumably because of the problem with using chrome on morph.)

The error on andylolz/example_ruby_chrome_headless_scraper is the one I’m having trouble with:

/app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/response.rb:69:in `assert_ok': unknown error: Chrome failed to start: crashed (Selenium::WebDriver::Error::UnknownError)
   (Driver info: chromedriver=2.37.544315 (730aa6a5fdba159ac9f4c1e8cbc59bf1b5ce12b7),platform=Linux 4.18.16-x86_64-linode118 x86_64)
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/response.rb:32:in `initialize'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/http/common.rb:81:in `new'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/http/common.rb:81:in `create_response'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/http/default.rb:104:in `request'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/http/common.rb:59:in `call'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/bridge.rb:164:in `execute'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/bridge.rb:97:in `create_session'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/remote/bridge.rb:53:in `handshake'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/chrome/driver.rb:47:in `initialize'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/common/driver.rb:44:in `new'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver/common/driver.rb:44:in `for'
 	from /app/vendor/bundle/ruby/2.5.0/gems/selenium-webdriver-3.11.0/lib/selenium/webdriver.rb:85:in `for'
 	from /app/vendor/bundle/ruby/2.5.0/gems/capybara-2.18.0/lib/capybara/selenium/driver.rb:23:in `browser'
 	from /app/vendor/bundle/ruby/2.5.0/gems/capybara-2.18.0/lib/capybara/selenium/driver.rb:49:in `visit'
 	from /app/vendor/bundle/ruby/2.5.0/gems/capybara-2.18.0/lib/capybara/session.rb:274:in `visit'
 	from scraper.rb:8:in `<main>'

jamezpolley · April 12, 2019, 11:56am

I’ve been able to make some progress on this, and I’ve got the example scraper working again.

I found that in Python I had to add in the --no-sandbox flag to get Chrome to run inside the container.

I’ve done the same thing in Ruby and it works there too. In Capybara doesn’t seem to have any way to pass options through, so I’ve defined a new driver - it’s the same as Capybara’s own headless chrome driver, but with the extra argument.

andylolz · November 9, 2019, 11:28pm

This sounds great – many thanks for following up on this, @jamezpolley.