New scraper platform available for testing

Cross-posted from https://www.oaf.org.au/2019/04/10/a-new-era-for-morph-io/

When morph.io was built, we leveraged a lot of awesome existing work already in the open-source ecosystem. In particular, we were able to heavily lean on work down by the Herokuish project had done to create a platform that can run scripts written in a variety of languages. The Herokuish project, in turn, built on the cedar-14 stack which had been open sourced by Heroku; and that in turn was built on top of Ubuntu 14.04, which builds on countless contributions from people all over the world.

Five years later, the cedar-14 stack is nearing the end of its life. As of this month (April 2019) Canonical will no longer support Ubuntu 14.04; Heroku will no longer support cedar-14 - and morph.io will no longer be able to support the platform we've relied on for so long.

Instead, we're going to be rolling out a new platform. We aren't changing too much though - the components we've been using are working well - we're simply upgrading them. The morph.io scraper platform will now be based on the current version of Herokuish, which is based on the Heroku-18 stack, which is based on Ubuntu 18.04, which is based on even more contributions from even more people around the world.

To help smooth the transition, we've added multi-platform support to Morph. The new platform is available now for you to start testing, and to start using if you find that it works for you.

We will be changing the default platform in early May when it's ready - see the Switch default platform project on Github for progress and updates. At that time, any scrapers which haven't explicitly chosen to stay on the existing platform will start to run on the new platform. We'll still keep the old platform around for a while longer to help anyone who didn't manage to update their scraper before the change.

What is changing?

Supported language versions

Heroku-18 supports  more recent language versions than were usable on Cedar-14. However, this means that many older versions aren't supported any more. If you were relying on an older version of your language, this is likely to mean you need to check for compatibility with the new version.

Heroku-18 currently supports Ruby 2.4.6, 2.5.5, and 2.6.2, as well as a wider list of Ruby versions availabe through JRuby. Python versions 2.7.16, 3.6.8, and 3.7.3 are currently supported. PHP is currently supported for version 7.1, 7.2, and 7.3. A full list of the supported versions for every language is available here.

If your scraper specifies a different language version, it's possible that Herokuish may still be able to install and run that version for you. However, we suggest that you upgrade to the supported versions wherever possible.

Operating system packages

The shift from Ubuntu 14.04 to Ubuntu 18.04 is a large one. Many of the packages on the base platform have been upgraded to a much newer version. However, in some cases, packages have been replaced by a newer equivalent package, or have gone away completely. In some cases, new packages have been added that add capabilities that weren't present before.

For the most part, we expect that you probably won't need to pay attention to the OS package versions. However, the full list of packages on both cedar-14 and heroku-18 is available from Heroku's support site.

What do I need to do?

Choosing a platform

You can choose which platform you want to use by including a file named platform in the top level of your scraper's repository. The contents of the file should match one of the tags in our buildstep container registry. Currently, the available tags are early_release for testing the new platform, and cedar-14 to keep your scraper using the old platform. If no platform file is found, we'll use the default release.

You can see an example of a scraper being configured for the new platform here.

Reporting problems

If you find bugs or run into problems on the new platform, please let us know on https://help.morph.io/

Very exciting @jamezpolley :slight_smile: Thanks for this work.

I’ve just opted this scraper in, and running it, I’m getting no output https://morph.io/austccr/australian_enterprise_bargaining_agreements

It’s hard to tell if it’s running and there’s no output, or if it’s stuck.

You’ll see my first run had an error, which was the platform missing the bundler version. Which reminded me there was an update I could opt into :slight_smile:

Hope that’s helpful. I’m happy to role it back and use the older bunlder version to get it working, but thought this might be useful for you.

Best wishes,

Luke

Hope that’s helpful. I’m happy to role it back and use the older bunlder version to get it working, but thought this might be useful for you.

I rolled back, and I’ve got output and output is working again.

Thanks Luke. I’ve created Ruby scrapers on new platform aren't producing output · Issue #1222 · openaustralia/morph · GitHub to track this.