Hey guys
I’m scraping an Austlii database for a particular kind of legal decision.
http://www.austlii.edu.au/
In Python requests, the page returns this.
'The requested resource is no longer available on this server and there is no forwarding address. Please remove all references to this resource.
I thought this might be a dynamic page, so I tested it in a headless web browser called Spynner, but that did not work.
So, I have ended up using Selenium web driver to pass the request into Firefox before it’s sent to to the server.
It works, and I have no idea why.
Any thoughts?
1 Like
henare
2
Is it the home page you’re scraping or some other page?
All the Austlii pages do that. I’m scraping this series of decisions in particular - http://www.austlii.edu.au/au/cases/cth/AICmr/
henare
4
Wow, I get the same with Ruby Mechanize
2.0.0-p353 :003 > agent.get "http://www.austlii.edu.au/au/cases/cth/AICmr/"
Mechanize::ResponseCodeError: 410 => Net::HTTPGone for http://www.austlii.edu.au/au/cases/cth/AICmr/ -- unhandled response
from /home/henare/.rvm/gems/ruby-2.0.0-p353/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch'
from /home/henare/.rvm/gems/ruby-2.0.0-p353/gems/mechanize-2.7.3/lib/mechanize.rb:440:in `get'
from (irb):3
from /home/henare/.rvm/rubies/ruby-2.0.0-p353/bin/irb:12:in `<main>'
2.0.0-p353 :004 >
And even wget
!
$ wget http://www.austlii.edu.au/au/cases/cth/AICmr/
--2015-11-12 13:46:01-- http://www.austlii.edu.au/au/cases/cth/AICmr/
Resolving www.austlii.edu.au (www.austlii.edu.au)... 138.25.65.22
Connecting to www.austlii.edu.au (www.austlii.edu.au)|138.25.65.22|:80... connected.
HTTP request sent, awaiting response... 410 Gone
2015-11-12 13:46:02 ERROR 410: Gone.
$
They’re checking useragents. Spoof it:
2.0.0-p353 :005 > agent.user_agent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)"
=> "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)"
2.0.0-p353 :006 > agent.get "http://www.austlii.edu.au/au/cases/cth/AICmr/"
=> #<Mechanize::Page
{url
#<URI::HTTP:0x00000004f9a160 URL:http://www.austlii.edu.au/au/cases/cth/AICmr/>}
{meta_refresh}
{title "Australian Information Commissioner"}
{iframes}
{frames}
{links
#<Mechanize::Page::Link "AustLII" "/">
#<Mechanize::Page::Link "Home" "/">
#<Mechanize::Page::Link "Databases" "/databases.html">
#<Mechanize::Page::Link "WorldLII" "http://www.worldlii.org">
#<Mechanize::Page::Link "Search" "/forms/search1.html">
#<Mechanize::Page::Link "Feedback" "/austlii/feedback.html">
#<Mechanize::Page::Link "Help" "/austlii/help/">
#<Mechanize::Page::Link "AustLII" "/">
#<Mechanize::Page::Link "Databases" "/databases.html">
#<Mechanize::Page::Link
"Database Search"
"/form/search1.html?mask=au/cases/cth/AICmr">
#<Mechanize::Page::Link
"Name Search"
"/form/search1.html?mask=au/cases/cth/AICmr&title=1">
#<Mechanize::Page::Link "Recent Decisions" "recent.html">
#<Mechanize::Page::Link "Help" "/austlii/help/cases.html">
#<Mechanize::Page::Link "A" "toc-A.html">
#<Mechanize::Page::Link "B" "toc-B.html">
#<Mechanize::Page::Link "C" "toc-C.html">
#<Mechanize::Page::Link "D" "toc-D.html">
#<Mechanize::Page::Link "E" "toc-E.html">
#<Mechanize::Page::Link "F" "toc-F.html">
#<Mechanize::Page::Link "G" "toc-G.html">
#<Mechanize::Page::Link "H" "toc-H.html">
#<Mechanize::Page::Link "I" "toc-I.html">
#<Mechanize::Page::Link "J" "toc-J.html">
#<Mechanize::Page::Link "K" "toc-K.html">
#<Mechanize::Page::Link "L" "toc-L.html">
#<Mechanize::Page::Link "M" "toc-M.html">
#<Mechanize::Page::Link "N" "toc-N.html">
#<Mechanize::Page::Link "O" "toc-O.html">
#<Mechanize::Page::Link "P" "toc-P.html">
#<Mechanize::Page::Link "Q" "toc-Q.html">
#<Mechanize::Page::Link "R" "toc-R.html">
#<Mechanize::Page::Link "S" "toc-S.html">
#<Mechanize::Page::Link "T" "toc-T.html">
#<Mechanize::Page::Link "U" "toc-U.html">
#<Mechanize::Page::Link "V" "toc-V.html">
#<Mechanize::Page::Link "W" "toc-W.html">
#<Mechanize::Page::Link "X" "toc-X.html">
#<Mechanize::Page::Link "Y" "toc-Y.html">
#<Mechanize::Page::Link "Z" "toc-Z.html">
#<Mechanize::Page::Link "2011" "2011/">
#<Mechanize::Page::Link "2012" "2012/">
#<Mechanize::Page::Link "2013" "2013/">
#<Mechanize::Page::Link "2014" "2014/">
#<Mechanize::Page::Link "2015" "2015/">
#<Mechanize::Page::Link
"Federal Privacy Commissioner of Australia Complaint Determinations"
"http://www.austlii.edu.au/au/cases/cth/PrivCmrACD/">
#<Mechanize::Page::Link
"Australian Information Commissioner Case Notes (AICmrCN)"
"http://www.austlii.edu.au/au/cases/cth/AICmrCN/">
#<Mechanize::Page::Link "OAIC website" "http://www.oaic.gov.au">
#<Mechanize::Page::Link "Copyright Policy" "/austlii/copyright.html">
#<Mechanize::Page::Link "Disclaimers" "/austlii/disclaimers.html">
#<Mechanize::Page::Link "Privacy Policy" "/austlii/privacy.html">
#<Mechanize::Page::Link "Feedback" "/austlii/feedback.html">}
{forms}>
2.0.0-p353 :007 >
Spoof useragents? What the what?
I wonder if making a spoof header would fix it?
Is that what you mean?
Great, using a header worked.
This is great, since now I can put my code on Morph.
Winning.
Cheers Hanare
1 Like