Koozali.org: home of the SME Server

Proxy Setup Options?

Offline ylluminate

  • **
  • 39
  • +0/-0
Proxy Setup Options?
« on: October 12, 2011, 12:01:28 AM »
We are presently doing a lot of scraping for some applications we run.  Presently all of the data is coming from government websites that allow free use of content, but they are blocking IP's ever so often.  This is making it difficult to scrape the enormous amount of data that must be pulled down and reconstituted.

Two questions: 
  • What are the "best" and most reliable proxy services available that we could use?
  • What options exist for setting up SME server to use this server and act as a gateway so that we can simply direct all specific traffic through these proxies?

Any input would certainly be appreciated.


-George

Offline cactus

  • *
  • 4,880
  • +3/-0
    • http://www.snetram.nl
Re: Proxy Setup Options?
« Reply #1 on: October 12, 2011, 09:56:04 AM »
We are presently doing a lot of scraping for some applications we run.  Presently all of the data is coming from government websites that allow free use of content, but they are blocking IP's ever so often.
Wouldn't it be better to contact them and to find out why you are blocked out and work out a more sensible scheme with them that suits you both?
There must be a reason for blocking your IPs.
Are you sure you are using their data in a proper licensed way?
Are you scraping all data from scratch every time? If so perhaps can you device some sort of caching algorithm to not scrape all again?
Be careful whose advice you buy, but be patient with those who supply it. Advice is a form of nostalgia, dispensing it is a way of fishing the past from the disposal, wiping it off, painting over the ugly parts and recycling it for more than its worth ~ Baz Luhrmann - Everybody's Free (To Wear Sunscreen)

Offline ylluminate

  • **
  • 39
  • +0/-0
Re: Proxy Setup Options?
« Reply #2 on: October 12, 2011, 10:06:34 AM »
No, this government organization is one of those unreachable ones that doesn't reply to emails for some odd reason.

Yes, it is being used properly according to their licensing.

I have to start off with an initial archival scrape so as to pull all old data in, so it's this initial scrape that's the bear and we only have a few days to do it.  Otherwise, yes, we do have it programmed to not pull data again.