Python requests not working, but Firefox does?

Hey guys

I’m scraping an Austlii database for a particular kind of legal decision.
http://www.austlii.edu.au/

In Python requests, the page returns this.

'The requested resource is no longer available on this server and there is no forwarding address. Please remove all references to this resource.

I thought this might be a dynamic page, so I tested it in a headless web browser called Spynner, but that did not work.

So, I have ended up using Selenium web driver to pass the request into Firefox before it’s sent to to the server.
It works, and I have no idea why.

Any thoughts?

1 Like

Is it the home page you’re scraping or some other page?

All the Austlii pages do that. I’m scraping this series of decisions in particular - http://www.austlii.edu.au/au/cases/cth/AICmr/

Wow, I get the same with Ruby Mechanize

2.0.0-p353 :003 > agent.get "http://www.austlii.edu.au/au/cases/cth/AICmr/"
Mechanize::ResponseCodeError: 410 => Net::HTTPGone for http://www.austlii.edu.au/au/cases/cth/AICmr/ -- unhandled response
    from /home/henare/.rvm/gems/ruby-2.0.0-p353/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch'
    from /home/henare/.rvm/gems/ruby-2.0.0-p353/gems/mechanize-2.7.3/lib/mechanize.rb:440:in `get'
    from (irb):3
    from /home/henare/.rvm/rubies/ruby-2.0.0-p353/bin/irb:12:in `<main>'
2.0.0-p353 :004 > 

And even wget!

$ wget http://www.austlii.edu.au/au/cases/cth/AICmr/
--2015-11-12 13:46:01--  http://www.austlii.edu.au/au/cases/cth/AICmr/
Resolving www.austlii.edu.au (www.austlii.edu.au)... 138.25.65.22
Connecting to www.austlii.edu.au (www.austlii.edu.au)|138.25.65.22|:80... connected.
HTTP request sent, awaiting response... 410 Gone
2015-11-12 13:46:02 ERROR 410: Gone.
$

They’re checking useragents. Spoof it:

2.0.0-p353 :005 > agent.user_agent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)"
 => "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)" 
2.0.0-p353 :006 > agent.get "http://www.austlii.edu.au/au/cases/cth/AICmr/"
 => #<Mechanize::Page
 {url
  #<URI::HTTP:0x00000004f9a160 URL:http://www.austlii.edu.au/au/cases/cth/AICmr/>}
 {meta_refresh}
 {title "Australian Information Commissioner"}
 {iframes}
 {frames}
 {links
  #<Mechanize::Page::Link "AustLII" "/">
  #<Mechanize::Page::Link "Home" "/">
  #<Mechanize::Page::Link "Databases" "/databases.html">
  #<Mechanize::Page::Link "WorldLII" "http://www.worldlii.org">
  #<Mechanize::Page::Link "Search" "/forms/search1.html">
  #<Mechanize::Page::Link "Feedback" "/austlii/feedback.html">
  #<Mechanize::Page::Link "Help" "/austlii/help/">
  #<Mechanize::Page::Link "AustLII" "/">
  #<Mechanize::Page::Link "Databases" "/databases.html">
  #<Mechanize::Page::Link
   "Database Search"
   "/form/search1.html?mask=au/cases/cth/AICmr">
  #<Mechanize::Page::Link
   "Name Search"
   "/form/search1.html?mask=au/cases/cth/AICmr&title=1">
  #<Mechanize::Page::Link "Recent Decisions" "recent.html">
  #<Mechanize::Page::Link "Help" "/austlii/help/cases.html">
  #<Mechanize::Page::Link "A" "toc-A.html">
  #<Mechanize::Page::Link "B" "toc-B.html">
  #<Mechanize::Page::Link "C" "toc-C.html">
  #<Mechanize::Page::Link "D" "toc-D.html">
  #<Mechanize::Page::Link "E" "toc-E.html">
  #<Mechanize::Page::Link "F" "toc-F.html">
  #<Mechanize::Page::Link "G" "toc-G.html">
  #<Mechanize::Page::Link "H" "toc-H.html">
  #<Mechanize::Page::Link "I" "toc-I.html">
  #<Mechanize::Page::Link "J" "toc-J.html">
  #<Mechanize::Page::Link "K" "toc-K.html">
  #<Mechanize::Page::Link "L" "toc-L.html">
  #<Mechanize::Page::Link "M" "toc-M.html">
  #<Mechanize::Page::Link "N" "toc-N.html">
  #<Mechanize::Page::Link "O" "toc-O.html">
  #<Mechanize::Page::Link "P" "toc-P.html">
  #<Mechanize::Page::Link "Q" "toc-Q.html">
  #<Mechanize::Page::Link "R" "toc-R.html">
  #<Mechanize::Page::Link "S" "toc-S.html">
  #<Mechanize::Page::Link "T" "toc-T.html">
  #<Mechanize::Page::Link "U" "toc-U.html">
  #<Mechanize::Page::Link "V" "toc-V.html">
  #<Mechanize::Page::Link "W" "toc-W.html">
  #<Mechanize::Page::Link "X" "toc-X.html">
  #<Mechanize::Page::Link "Y" "toc-Y.html">
  #<Mechanize::Page::Link "Z" "toc-Z.html">
  #<Mechanize::Page::Link "2011" "2011/">
  #<Mechanize::Page::Link "2012" "2012/">
  #<Mechanize::Page::Link "2013" "2013/">
  #<Mechanize::Page::Link "2014" "2014/">
  #<Mechanize::Page::Link "2015" "2015/">
  #<Mechanize::Page::Link
   "Federal Privacy Commissioner of Australia Complaint Determinations"
   "http://www.austlii.edu.au/au/cases/cth/PrivCmrACD/">
  #<Mechanize::Page::Link
   "Australian Information Commissioner Case Notes (AICmrCN)"
   "http://www.austlii.edu.au/au/cases/cth/AICmrCN/">
  #<Mechanize::Page::Link "OAIC website" "http://www.oaic.gov.au">
  #<Mechanize::Page::Link "Copyright Policy" "/austlii/copyright.html">
  #<Mechanize::Page::Link "Disclaimers" "/austlii/disclaimers.html">
  #<Mechanize::Page::Link "Privacy Policy" "/austlii/privacy.html">
  #<Mechanize::Page::Link "Feedback" "/austlii/feedback.html">}
 {forms}>
 
2.0.0-p353 :007 > 

Spoof useragents? What the what?
I wonder if making a spoof header would fix it?
Is that what you mean?

Great, using a header worked.
This is great, since now I can put my code on Morph.

Winning.
Cheers Hanare

1 Like