Using a Proxy with a Randomized User Agent in Python Requests

When conducting an extensive web scraping operation or trying to assess your defense against one, simulating a number of IP addresses and user agents is a useful technique to use.

In this post, I’ll describe how one might use the requests Python module along with proxy server(s) to mak lots of requests in the most inconspicuous way possible.

Python requests Module

Installing is amazingly simple:

$ pip install requests

as is executing a simple GET request:

import requests

url = 'http://api.openweathermap.org/data/2.5/weather'
r = requests.get(url)
content = r.html  # that's all it takes!

Of course it’s quite simple to add parameters. Let’s get the weather in London and load the JSON as a Python object:

import requests

# want: http://api.openweathermap.org/data/2.5/weather?q=London,uk
url = 'http://api.openweathermap.org/data/2.5/weather'
params = {"q" : "London,uk"}
r = requests.get(url, params=params)
response = r.json  # a python dictionary

Using Proxies

Proxies are a way to tell server P (the middleman) to contact server A and then route the response back to you. In more nefarious cirlcles, it’s a prime way to make your prescence unknown and pose as many clients to a website instead of just one client. Often times websites will block IPs that make too many requests, and proxies are one such way people get around this. But even for simulating an attack, you should know how it’s done.

With a proxy set up, you can make requests on behalf of it with the following syntax:

import requests

# Proxy format:
proxy = {
	"http": "http://username:p3ssw0rd@",

# now make the request
url = 'http://api.openweathermap.org/data/2.5/weather'
params = {"q" : "London,uk"}
r = requests.get(url, proxies=proxy, params=params)

Often proxies will require authentication so random people on the internet don’t swamp them with requests and get the proxy pool poisoned or banned on popular search engines like Google or Yahoo.

If you want to find a proxy, just search online – most prices are reasonable, and they’ll have a decent sized pool of IP addresses in various countries. Oftentimes search engine queries cost more as a lot of folks use them for SEO analytics. Naturally, these proxy providers don’t want their precious IPs banned there.

Randomizing Your User Agent

The final step is randomizing a user agent. Amazingly, the only thing that tells a server the method of request (like browser type or from a script) is just a string called a “user agent” which is included in the HTTP request.

A raw HTTP request to python.org from my Chrome browser looks like this:

GET / HTTP/1.1
Host: www.python.org
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8

There’s a lot going on here, but the important note is that each line is a header with a key and a value, separated by a colon. Luckily we can have requests manage all of this for us.

Let’s make a function that uses a list ~900 of user agents like this one

"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)"
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)"
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)"

in order to select one at random.

import random

def LoadUserAgents(uafile=USER_AGENTS_FILE):
	uafile : string
		path to text file of user agents, one per line
    uas = []
    with open(uafile, 'rb') as uaf:
        for ua in uaf.readlines():
            if ua:
    return uas

# load the user agents, in random order
user_agents = LoadUserAgents(uafile="user_agents.txt")

Putting It All Together

Let’s grab a random user agent and make requests through a proxy.

import random
import requests

proxy = {"http": "http://username:p3ssw0rd@"}
url = 'http://api.openweathermap.org/data/2.5/weather'
params = {"q" : "London,uk"}

# load user agents and set headers
uas = LoadUserAgents()
ua = random.choice(uas)  # select a random user agent
headers = {
    "Connection" : "close",  # another way to cover tracks
    "User-Agent" : ua}

# make the request
r = requests.get(url, proxies=proxy,
	params=params, headers=headers)

Pretty painless!