I’ve been doing quite a bit of scraping, not using Morph I’ll admit.
I just wanted to throw this topic open, on the ethics of scraping.
I generally think it’s good to pad the requests to a single server if you’re scraping several pages.
Otherwise you’re going to potentially slow it down.
I just wondered if people were identifying themselves in their scrapers?
This article presents a compelling argument, that if we want governments to be open and transparent, then our scraping should be transparent also.
Therefore, the article suggests:
- Identifying yourself in the header (using Python Requests in my case).
- Checking the robot.txt for permission.
I’m of two minds about this, since it should be up to me how I choose to consume HTML.
Do others have thoughts on these issues?
Unless there’s a very good reason to, I don’t modify the user agent. This allows the site owner to identify my scraping. I’ve never checked robots.txt but I notice it’s an option I could turn on in the library I mainly use.
In practice I’ve never found this to be an issue. But then again that could be because the type of data I’m most commonly opening should be open anyway. I think it’s a lot more problematic if you’re scraping for say, commercial gain or competitive advantage (which I know happens a lot - there are companies devoted to just this).
True, regarding government data.
I should have pointed out I was talking about government data - I would actually ask a business before I scraped their site.
I know there’s nothing technically different between scraping and web browsing, but content owners might see it different.
I was actually thinking, if governments don’t like being scraped, they should have APIs.
Hackers gonna hack.