Cannot Scrape HTTPS Site (SSL Error)

MichaelBone · July 9, 2018, 11:57am

Hi,

I can’t figure out how to avoid the following SSL errors when scraping a https web site using morph.io (whereas the exact same scraper code works successfully on my Windows 10 PC when run using VSCode):

write EPROTO 140366154426240:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:../deps/openssl/openssl/ssl/s23_clnt.c:827:

and (when secureProtocol is set to TLSv1_method, TLSv1_1_method or TLSv1_2_method):

write EPROTO 140587175036800:error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number:../deps/openssl/openssl/ssl/s3_pkt.c:365:

This is my scraper:

https://morph.io/MichaelBone/city_of_burnside_sa_development_applications

And this is the web site (which I’m fairly sure is TLS 1.0, 1.1 or 1.2 because IE 11 rejects it if I don’t enable a TLS version in its options):

I’ve tried “agentOptions: { secureProtocol: "TLSv1_2_method" }” and just about every other TLS and SSL method including SSLv23_method.

I’ve tried re-writing it in Ruby and using chrome headless (and phantomjs): see https://morph.io/MichaelBone/city_of_burnside_south_australia_development_applications_test (based on the examples at https://morph.io/documentation/scraping_javascript_sites).

I’ve tried “--ignore-ssl-errors=yes” and “--ssl-protocol=any” when using phantomjs.

I’ve tried “process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0'”.

I’ve tried different versions of node.js (including 10.6.0).

I’ve experimented with introducing certificates using ssl-root-cas. I’ve experimented with “verify_mode: false”, “use_ssl = false” and “verify_mode = OpenSSL::SSL::VERIFY_NONE”.

In all cases I’m either getting an SSL error or a “blank” page (and by a “blank” page I mean just “<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>” from capybara instead of the fairly extensive HTML of the actual web site).

I’m beginning to wonder if the mitmproxy is causing the issue (maybe similar to https://github.com/openaustralia/morph/issues/1135 or maybe https://github.com/openaustralia/buildstep is relevant). Perhaps somehow bypassing or turning off the mitmproxy would help. Or maybe the https site has SSL configured in an unusual way.

Any ideas? Can you help?

thanks,
Michael

MichaelBone · July 18, 2018, 12:34am

Answering my own question

I found a workaround for this is to use a proxy server. For example, define a “Secret environment variable” named, say, “MORPH_PROXY” (accessed by clicking the Settings button on the scraper page in morph.io) and set it to the name of a known, working proxy server (for HTTPS). And then add the following code to the (Ruby) scraper script to extract the value of the secret environment variable and use it to set the proxy server on the agent:

host, port = ENV['MORPH_PROXY'].split(":")
agent = Mechanize.new
agent.set_proxy(host, port)

In my case this allowed the web site to be successfully scraped (but, of course, this relies on that proxy server - which is independent of morph.io - to continue to work).

jamezpolley · March 28, 2019, 1:47am

8 months later, we’ve pushed a newer version of mitmproxy which seems to handle https better.

I’ve updated https://morph.io/planningalerts-scrapers/burnside - which I think is a fork of the repo you mentioned above - to disable the MORPH_PROXY setting, and confirmed that it’s now working through mitmproxy. Could you confirm if this fixes your original problem?

MichaelBone · March 28, 2019, 9:19am

yes, great work @jamezpolley! that fixes my original problem.

(I’ve just tested by removing the proxy server setting from the 19 scrapers I had added it to and they now all work correctly without the proxy server setting, ie. no more SSL errors.)

thanks,
Michael