Koozali.org: home of the SME Server

Contribs.org Forums => General Discussion => Topic started by: ylluminate on October 12, 2011, 12:01:28 AM

Title: Proxy Setup Options?
Post by: ylluminate on October 12, 2011, 12:01:28 AM
We are presently doing a lot of scraping for some applications we run.  Presently all of the data is coming from government websites that allow free use of content, but they are blocking IP's ever so often.  This is making it difficult to scrape the enormous amount of data that must be pulled down and reconstituted.

Two questions: 

Any input would certainly be appreciated.


-George
Title: Re: Proxy Setup Options?
Post by: cactus on October 12, 2011, 09:56:04 AM
We are presently doing a lot of scraping for some applications we run.  Presently all of the data is coming from government websites that allow free use of content, but they are blocking IP's ever so often.
Wouldn't it be better to contact them and to find out why you are blocked out and work out a more sensible scheme with them that suits you both?
There must be a reason for blocking your IPs.
Are you sure you are using their data in a proper licensed way?
Are you scraping all data from scratch every time? If so perhaps can you device some sort of caching algorithm to not scrape all again?
Title: Re: Proxy Setup Options?
Post by: ylluminate on October 12, 2011, 10:06:34 AM
No, this government organization is one of those unreachable ones that doesn't reply to emails for some odd reason.

Yes, it is being used properly according to their licensing.

I have to start off with an initial archival scrape so as to pull all old data in, so it's this initial scrape that's the bear and we only have a few days to do it.  Otherwise, yes, we do have it programmed to not pull data again.