When conducting an extensive web scraping operation or trying to assess your defense against one, simulating a number of IP addresses and user agents is a useful technique to use.
In this post, I’ll describe how one might use the requests
Python module along with proxy server(s) to mak lots of requests in the most inconspicuous way possible.
Python requests
Module
Installing is amazingly simple:
$ pip install requests
as is executing a simple GET
request:
import requests
url = 'http://api.openweathermap.org/data/2.5/weather'
r = requests.get(url)
content = r.html # that's all it takes!
Of course it’s quite simple to add parameters. Let’s get the weather in London and load the JSON as a Python object:
import requests
# want: http://api.openweathermap.org/data/2.5/weather?q=London,uk
url = 'http://api.openweathermap.org/data/2.5/weather'
params = {"q" : "London,uk"}
r = requests.get(url, params=params)
response = r.json # a python dictionary
Using Proxies
Proxies are a way to tell server P (the middleman) to contact server A and then route the response back to you. In more nefarious cirlcles, it’s a prime way to make your prescence unknown and pose as many clients to a website instead of just one client. Often times websites will block IPs that make too many requests, and proxies are one such way people get around this. But even for simulating an attack, you should know how it’s done.
With a proxy set up, you can make requests on behalf of it with the following syntax:
import requests
#####
# Proxy format:
# http://<USERNAME>:<PASSWORD>@<IP-ADDR>:<PORT>
#####
proxy = {
"http": "http://username:p3ssw0rd@10.10.1.10:3128",
}
# now make the request
url = 'http://api.openweathermap.org/data/2.5/weather'
params = {"q" : "London,uk"}
r = requests.get(url, proxies=proxy, params=params)
Often proxies will require authentication so random people on the internet don’t swamp them with requests and get the proxy pool poisoned or banned on popular search engines like Google or Yahoo.
If you want to find a proxy, just search online – most prices are reasonable, and they’ll have a decent sized pool of IP addresses in various countries. Oftentimes search engine queries cost more as a lot of folks use them for SEO analytics. Naturally, these proxy providers don’t want their precious IPs banned there.
Randomizing Your User Agent
The final step is randomizing a user agent. Amazingly, the only thing that tells a server the method of request (like browser type or from a script) is just a string called a “user agent” which is included in the HTTP request.
A raw HTTP request to python.org
from my Chrome browser looks like this:
GET / HTTP/1.1
Host: www.python.org
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
There’s a lot going on here, but the important note is that each line is a header with a key and a value, separated by a colon. Luckily we can have requests
manage all of this for us.
Let’s make a function that uses a list ~900 of user agents like this one
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)"
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)"
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)"
...
in order to select one at random.
import random
def LoadUserAgents(uafile=USER_AGENTS_FILE):
"""
uafile : string
path to text file of user agents, one per line
"""
uas = []
with open(uafile, 'rb') as uaf:
for ua in uaf.readlines():
if ua:
uas.append(ua.strip()[1:-1-1])
random.shuffle(uas)
return uas
# load the user agents, in random order
user_agents = LoadUserAgents(uafile="user_agents.txt")
Putting It All Together
Let’s grab a random user agent and make requests through a proxy.
import random
import requests
proxy = {"http": "http://username:p3ssw0rd@10.10.1.10:3128"}
url = 'http://api.openweathermap.org/data/2.5/weather'
params = {"q" : "London,uk"}
# load user agents and set headers
uas = LoadUserAgents()
ua = random.choice(uas) # select a random user agent
headers = {
"Connection" : "close", # another way to cover tracks
"User-Agent" : ua}
# make the request
r = requests.get(url, proxies=proxy,
params=params, headers=headers)
Pretty painless!