Faster HTTP Download
- 4 minutes read - 813 words
HTTP Download
Even though there are specific protocols for file transfer (FTP/SFTP) most content over the internet is served and transferred over HTTP. The data is sent back as HTTP response which doesn’t have a defined limit in RFC but is limited by the integer
field in the HTTP header that specifies the content-length
and sometimes imposed by servers as well.
As mentioned in the name the protocol is designed to send over text data and its use to transmit binary file content is somewhat of an overload in my opinion.
This article is about exploring how to use HTTP download more efficiently.
All code is using a 1 GB
file hosted here for download testing purposes, and saves the file to disk as 1GB.zip
.
url = 'http://212.183.159.230/1GB.zip'
filename = '1GB.zip'
Simple Download
Doing a simple curl
or wget
for a large file served over HTTP opens a keep-alive
connection that either opens a stream or waits for the complete data to arrive at the application layer, kept in memory and then handed over it to the disk. At transport level (TCP) it is just one connection which is limited by the window size
on how much data can be transferred at a time and once that chunk is acknowledged the next chunk is sent.
A simple Python code using requests
can be written to simulate what curl
or wget
does.
response = requests.get(url)
with open(filename, 'wb') as fp:
fp.write(response.content)
This basically keeps all the data in memory and then writes it back to the disk.
Using Streams
Memory can become a limiting factor while handling large files in this fashion. Most HTTP libraries provide a way to mitigate that by providing a stream handler. As soon as a predefined size of chunk has been received it gives back control to the application layer stack and let it handle the data, once it is done ask for more from the server. This reduces the memory requirement but significantly increases the disk write system call and increases the time taken to download the file.
cache = 10*1024*1024 # 10 MB
response = requests.get(url, stream=True)
with open(filename, 'wb') as fp:
for chunk in response.iter_content(cache):
fp.write(chunk)
HTTP Ranges
Aforementioned methods pose another issue, in case of failure there is no way to resume or do a partial download. To do that HTTP supports the ranges
option which can specify the start and end bytes to be fetched from the server. This provides two benefits, first resuming a failed download and second opening multiple connections to overcome TCP window limit and utilizing the available bandwidth more efficiently. Being an optional option it has to be supported by the server and the client should maintain state to keep track of downloaded bytes so far.
Checking Server Support
There are two ways to check whether a server supports the HTTP ranges. First by sending a HEAD
request to the url and checking if Accept-Ranges: bytes
is present. Second, to poll the data directly with range options in the request header and seeing if the server responds with 206: partial
response.
resp = requests.head(url)
supports = 'Accept-Ranges' in resp.headers and resp.headers['Accept-Ranges'] == 'bytes'
headers = {"Range": "bytes=0-100"}
resp = requests.get(url, headers=headers)
supports = resp.status_code == 206:
Getting Content Size
To properly plan a range download over multiple connections it is essential to get the total size of the content client is about to download. Every partial request comes with the information as part of the 'Content-Range'
header.
Content-Range: bytes start-end/totalBytes
Polling to get total bytes of the content.
headers = {"Range": "bytes=0-1"}
response = requests.get(url, headers=headers)
rangedata = response.headers.get('Content-Range')
total_bytes = int(rangedata.split('/')[1])
Managing Download
Spawning concurrent worker processes to handle partial download is the way to go. But make sure that the client has enough cores to efficiently handle multiple processes and also check how many parallel connections does server support from a single IP.
segment_size = 100*1024*1024 # 100 MB
start = 0
end = start + segment_size
processes = []
part_files = []
segment = 1
cache = 10*1024*1024 # 10 MB strea cache
concurrent_conn = 4 # number of parallel connections
while True:
partfilename = "part_{0}.zip".format(segment)
part_files.append(partfilename)
p = Process(target=downloadRange, args=(url, start, end, partfilename, cache))
segment += 1
processes.append(p)
p.start()
if len(processes) >= concurrent_conn:
for p in processes:
p.join()
processes.clear()
start = end + 1
if start > total_bytes:
break
end = start + segment_size
end = min(end, total_bytes)
if len(processes):
for p in processes:
p.join()
with open(filename, 'wb') as fp:
for f in part_files:
with open(f, 'rb') as fpart:
fp.write(fpart.read())
os.remove(f)
Result
As mentioned above all tests were done using a 1 GB file hosted here. Whenever the stream was enabled the cache size was 10 MB
. For range download the segment size was 100 MB
and six parallel connections were used.
- Direct : 312.41 seconds
- Direct (Stream) : 347.71 seconds
- Range (Stream) : 237.13 seconds