With the shit that's going down in Hong Kong, Iran and dozens of other places around the world, VPN's and Proxies are in high demand. Wouldn't it be great if we had a free unlimited supply of web proxies? Googling around for some free ones, I mostly found sketchy websites with sketchy paid plans, and a couple Pastebin links with massive lists of IP's and ports...
...massive lists with IP's and ports...
But before we even try to automatically search Pastebin for new proxy lists, we have to make sure that we can test proxies in the first place; Thankfully, .net comes with full proxy support and you can use it in a single line!
I've made one and put it on github, I call it SockPuppet
because it tests SOCKS web proxies.
My version is a simple Producer-Consumer setup that only keeps twice as many proxies in memory as it has worker threads, in case the lists get really big after running the scraper for a few weeks or months. I've annotated the code and encourage you to quickly read over it.
Proxies here is a BlockingCollection<Proxy>. If you never heard about BlockingCollection<T>, it's kinda like a threadsafe List<T> that behaves like a ConcurrentQueue<T>. You foreach it just like a List and every iteration the current item is removed from the collection. That being said, a foreach loop on a BlockingCollection will *never return*. The *Collection* is going to start *Blocking* when it's empty - which is perfect for the consumer workloop of our Producer/Consumer setup.
In addition, it can also be made to block on Add(T item)! Usually we'd need a lock of some kind to keep the 'queue' from growing too big. Simply instantiate it with the overload that accepts an int for boundCapacity like so:
var blockingColl = new BlockingCollection<T>(10);
and bockingColl.Add() will block as soon as there's 10 elements in it.
private static void WorkLoop() { // threads will be blocked at Proxies.GetConsumingEnumerable() // as long as everything INSIDE the foreach loop is threadsafe // we can throw as many threads on it as our network can handle foreach (var proxy in Proxies.GetConsumingEnumerable()) { proxy.Test(); // connect to proxy, download website if (proxy.Alive && proxy.Safe)// Safe = Identical response as without proxy Writer?.WriteLine(proxy); // write to output file // write to stdout, IP starts at position 8 Console.WriteLine($"{(proxy.Alive ? "[ Up! ]" : "[Down!]")}{proxy}"); Trace(proxy); } }
Please note that Writer is a StreamWriter and is *not* threadsafe, also Console's threadsafety will break if you start changing colors.
private static void StartThreads(int threadCount) { Threads = new Thread[threadCount]; for (int i = 0; i < threadCount; i++) { Threads[i] = new Thread(WorkLoop); // Target the method above // IsBackground = true // if the main thread exits kill this thread, // don't wait for it to exit and keep a zombie process running Threads[i].IsBackground = true; Threads[i].Start(); } }
We're having an easy time again, the parser will be a breeze!
The go-to format you find on Pastebin is 'IP:PORT' like so:
222.252.25.168:8080 178.200.170.41:80 50.197.38.230:60724 ...
Let's write a parser that will only read a sane amount of lines while checking them in the `WorkLoop` on `N` threads, great performance, great resource utilization. We don't waste RAM or sacrifice startup time by reading everything and instead do things on demand with a healthy buffer.
while (!Reader.EndOfStream) // while there is shit to read { //always trim your lines! var line = Reader.ReadLine().Trim(); var parts = line.Split(':');//naive implementation var ip = parts[0]; // expecting only valid data var port = ushort.Parse(parts[1]); // (╯°□°)╯︵ ┻━┻ var proxy = new Proxy(ip, port, timeout); Proxies.Add(proxy); // BlockingCollection<Proxy> }
Now, testing the Proxy is super easy. WebProxy is a built-in class, just like HttpWebRequest. Both together and we're virtually done.
public void Test() { try { HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://her.st"); var proxy = new WebProxy(IP, Port); proxy.BypassProxyOnLocal = false; request.Proxy = proxy; request.UserAgent = "ProxyTester Version: 1"; request.Timeout = Timeout; WebResponse webResponse = request.GetResponse(); var reader = new StreamReader(webResponse.GetResponseStream()); _response = reader.ReadToEnd().Trim(); Alive = true; // proxy is working } catch { Alive = false; // proxy is not working } }
It would make sense to do multiple rounds of sorting, by latency, by throughput, by country, we will focus on sorting by country first. There's 100's of 'free' APIs for 'Geo-IP' lookups, but there's always one catch: after a certain amount of queries, they ask for your credit card.
How do those services do it? I've done some googling and as it turns out, there's free databases available, most notably the IP2Location Db's
we will use LITE-DB5 which has a C# parser by the Taiwanese Sky Land Universal Corporation licensed under the Unlicense. Perfect. Let's legally steal their code and hook it up.
private static void Trace(Proxy proxy) { var location = Locator.Locate(IPAddress.Parse(proxy.IP)); if (!OrderedProxies.ContainsKey(location.Country)) OrderedProxies.Add(location.Country,new List<Proxy>()); OrderedProxies[location.Country].Add(proxy); }
As you can see in the code above, I created a Dictionary<string Country, List<Proxy>> namely OrderedProxies which lets me group the proxies by country very easily. In the next part, I hope to have had time to refactor and rename most of the worst variable and class names. I've been facepalming too much while writing this article. I am fully aware that the Trace method makes the entire producer/consumer setup pointless.
There's still a lot to be added to this little service. Off the top of my head:
see you soon