Python3 Scraper with no requirements.txt fails

yuletide · March 13, 2019, 2:23am

While trying to debug a “table swvariables not found” error since upgrading to python3 (because of a different issue) I’m not seeing this when trying to run a python 3.6.2 scraper without requirements.txt specified.

> Injecting configuration and compiling...
>  e[1G       e[1G-----> Python app detected
>  e[1G-----> Installing python-3.6.2
>  e[1G-----> Installing pip
>  e[1G-----> Installing requirements with pip
>  e[1G       Obtaining scraperwiki from git+http://github.com/openaustralia/scraperwiki-python.git@morph_defaults#egg=scraperwiki (from -r /tmp/build/requirements.txt (line 2))
> e[1G       Cloning http://github.com/openaustralia/scraperwiki-python.git (to revision morph_defaults) to /app/.heroku/src/scraperwiki
>  e[1G       Switched to a new branch 'morph_defaults'
> e[1G       Branch morph_defaults set up to track remote branch morph_defaults from origin.
>  e[1G       Collecting BeautifulSoup==3.2.0 (from -r /tmp/build/requirements.txt (line 9))
>  e[1G       Downloading https://files.pythonhosted.org/packages/33/fe/15326560884f20d792d3ffc7fe8f639aab88647c9d46509a240d9bfbb6b1/BeautifulSoup-3.2.0.tar.gz
>  e[1G       Complete output from command python setup.py egg_info:
>  e[1G       Traceback (most recent call last):
>  e[1G       File "<string>", line 1, in <module>
>  e[1G       File "/tmp/pip-install-aapzjpq7/BeautifulSoup/setup.py", line 22
>  e[1G       print "Unit tests have failed!"
>  e[1G       ^
>  e[1G       SyntaxError: Missing parentheses in call to 'print'
> e[1G       
> e[1G----------------------------------------
>  e[1G       Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-aapzjpq7/BeautifulSoup/

LoveMyData · March 13, 2019, 7:04am

I think you need to specify requirement.txt file

https://morph.io/documentation/libraries

yuletide · March 14, 2019, 5:00pm

For context: I thought so too but was running into issues with the scraper not seeing my DB table (table swvariables not found) so I thought the requirements.txt was maybe messing up the db injection. Perhaps it was actually a version issue with SQLAlchemy instead?

Downgraded to python 2.7.14 and it built, and connected to the db without issue (without a requirements file). Now am getting an SSL error with requests so looking into that

LoveMyData · March 14, 2019, 10:36pm

requirements.txt maybe relating to this and SSL may be relating to this.

my understanding is all traffic from morph go via a proxy on the host itself. this is how morph get x pages scraped in the summary.

yuletide · March 19, 2019, 8:28pm

Thanks for the reply! So with that buildpack issue I’m wondering how that would prevent scraperwiki from finding the db. My Python3 requirements.txt specified scraperwiki 0.5.1 and SQLAlchemy 1.3.1, which ran fine locally only, failing when pushed. Is it using a different version somehow?

For the SSL error, does that mean the issue is due to the site cert being out of date? Also Tried setting verify=false but it still fails.

LoveMyData · March 20, 2019, 12:01am

I don’t understand it fully but I think ‘requirements.txt’ is a bit like ‘composer.json’ in php world and all of my php scrapers that are using composer are failed, see here.

Those doesn’t use composer are working

jamezpolley · April 12, 2019, 12:11pm

I’m not sure where it’s coming from, but there seems to be a requirements.txt in use here:

 Obtaining scraperwiki ... (from -r /tmp/build/requirements.txt (line 2))
 Collecting BeautifulSoup==3.2.0 (from -r /tmp/build/requirements.txt (line 9))

That’s really really old - BeautifulSoup 3.x only works on Python 2.x; the “Missing parentheses” error is coming from Python 2 code.

Ah, here we go!

The requirements.txt seems to be this default file. This is mentioned in https://github.com/openaustralia/morph/blob/master/app/views/documentation/default_libraries/_python.html.haml - but I can’t see where that ends up in the actual live docs.

For now the best suggestion I’ve got is to just add an empty requirements.txt to make sure that the default one isn’t getting used. That’s not a great long-term solution but it’s 10:30pm on Friday