ias3 Internet archive S3-like API
On this page
ias3 Internet archive S3-like API#
This document is intended for a user who is comfortable in the unix command line environment. It covers the technical details of using the archive’s S3 like server API.
The internet archive has a public facing API which allows for the storage and retrieval of item data stored in the archive.
Amazon s3 documentation used when creating the IA S3 api endpoint can be found here.
Information about archive items can be found here.
To get api keys for the archive’s S3-Like API go to the API key page
What the S3-like API does:#
Items (things with details pages) get mapped to S3 Buckets. For example, http://archive.org/details/stats is also available as http://s3.us.archive.org/stats, or, per s3 dns bucket style http://stats.s3.us.archive.org/
Files within items are also available as S3 keys, For example http://stats.s3.us.archive.org/downloadsPerDay.png
Doing a
PUT
on the S3 endpoint will result in a new internet archive ItemFiles may also be uploaded to an Item in the same way keys are added, via S3
PUT
.When a file is added to an Item, it is staged in temporary storage and ingested via the Archive’s content management system. This can take some time.
ias3 has support for multipart uploads.
Python Library#
The internet archive maintains a python library for accessing this api, see https://github.com/jjjake/internetarchive .
POST Support#
- For using the POST support these documents are very useful:
How this is different from normal S3#
DELETE bucket is not allowed.
Only the HTTP 1.1 REST interface is supported.
- Archive is much more likely to issue 307 Location redirects than Amazon is.
Which means clients with good 100-Continue support are very nice to have
curl versions curl-7.19 and newer have excellent 100-continue support
curl versions curl-7.58 and newer need to add
--location-trusted
flag if you are including an Authorization headerACLs are fake. permissions are: World readable, Item uploader writable.
HTTP 1.1 Range headers are ignored (also copy range headers for multipart).
There are special features of the archive s3 connector to support activities with Internet Archive items. These are used by adding http headers to a request.
There is a combined upload and make item feature,
set the header: x-archive-auto-make-bucket:1
when doing a PUT
of a file.
- An HTTP header can specify metadata that ends up in
_meta.xml
at make bucket time. Add headers of form
x-archive-meta-$meta_name:$meta_value
(orx-amz-meta-$meta_name:$meta_value
)If you want multiple tags in _meta.xml you can put numbers in front:
x-amz-meta01-$meta_name:$meta_value_a
x-amz-meta02-$meta_name:$meta_value_b
Meta headers are sorted prior to tag generation when placed in the xml
Meta headers are interpreted as having utf-8 character encoding
Because rfc822 http headers disallow
_
in names, in $meta_name two hyphens in a row (--
) will be translated to an underscore(_
).Some http clients do not allow the full range of utf-8 bytes to appear in http headers. As a work around, one can encode a utf-8 meta header with uri encoding. To do this write all the header data like so:
uri($payload_as_uri_encoded_utf8)
For example, to set the title of an item to include the unicode snowman:x-archive-meta-title:uri(This%20is%20a%20snowman%20%E2%98%83)
to update metadata use the Item Metadata API API
Skip request signing#
There is an optional simple way to authorize a request; Authorization header
can be of form Authorization: LOW $accesskey:$secret
Be sure to use https when using the simple mode.
Skip derive process#
Normally a PUT (aka file upload) to IA will cause the derive
process to be queued for that bucket/item. The derive process
produces a number of secondary files from an upload to make
an upload more usable on the web. For example, the derive process
creates image data to make a PDF file viewable in the in browser bookviewer
on the archive.org website. You can prevent this (heavyweight process)
by specifying a header like so with each upload: x-archive-queue-derive:0
Delete derived files when an original is deleted#
DELETE normally deletes a single file, additionally all the
derivatives and originals related to a file can be
automatically deleted by specifying a header with the DELETE
like so: x-archive-cascade-delete:1
Keep old versions of files#
Normally PUT and DELETE do not keep old versions of files around.
To have the archive keep old versions of the object you can
add the header: x-archive-keep-old-version:1
Saved versions will be placed in history/files/$key.~N~
(For multipart, the x-archive-keep-old-version
header must be
specified at the time the multipart upload is completed)
Hint the archive about the final size of an item#
For large items a size hint can be given to the IA content
management system at make bucket time.
Units are in bytes, for example: x-archive-size-hint:19327352832
Express queue#
For uploads which need to be available ASAP in the content
management system, an interactive user’s upload for example,
one can request interactive queue priority:
x-archive-interactive-priority:1
Do not perform multiple transactions on items with mixed settings for
x-archive-interactive-priority
. Doing so can result in reordering
of the application of requests to the item state.
Dealing with API errors#
To help developers test the error processing of software interacting with the S3-like API,
there is an error simulation feature.
To simulate errors s3 supports a special ‘error this request’ header.
For example, to simulate a Slowdown error that the s3 api may
generate you can set an http header like so (in addition to any
other headers you may normally send in a request):
x-archive-simulate-error:SlowDown
For example:
$ curl s3.us.archive.org -v -H x-archive-simulate-error:SlowDown
To see a list of errors s3 can simulate, you can do:
$ curl s3.us.archive.org -v -H x-archive-simulate-error:help
Use Limits#
Sometimes the task queue system which processes PUTs and DELETEs becomes overloaded, and the endpoint returns a 503 SlowDown error instead of processing an upload or delete. To check if an upload would fail because of overload you can call:
$ curl http://s3.us.archive.org/?check_limit=1&accesskey=$accesskey&bucket=$bucket
- The result is a json object with 4 fields:
bucket
,accesskey
,over_limit
, anddetail
detail
contains internal information about the current rate limiting scheme, it may change at any time.The
over_limit
field will be either 0 to indicate that the queue is ready for more uploads or deletes, or 1, indicating that uploads or deletes are likely to get a 503 SlowDown error. The fields bucket and accesskey are the query arguments passed in.
Examples#
These features combined allow single command document upload with curl. (Most users would probably use the internetarchive command line tool instead, these are examples for developers of new library code.)
Text item (a PDF will be OCR’d):
curl --location --header 'x-amz-auto-make-bucket:1' \
--header 'x-archive-meta01-collection:opensource' \
--header 'x-archive-meta-mediatype:texts' \
--header 'x-archive-meta-sponsor:Andrew W. Mellon Foundation' \
--header 'x-archive-meta-language:eng' \
--header "authorization: LOW $accesskey:$secret" \
--upload-file /home/samuel/public_html/intro-to-k.pdf \
http://s3.us.archive.org/sam-s3-test-08/demo-intro-to-k.pdf
Movie item (Will get video player on details page):
curl --location --header 'x-amz-auto-make-bucket:1' \
--header 'x-archive-meta01-collection:opensource_movies' \
--header 'x-archive-meta-mediatype:movies' \
--header 'x-archive-meta-title:Ben plays piano.' \
--header "authorization: LOW $accesskey:$secret" \
--upload-file ben-2009-05-09.avi \
http://s3.us.archive.org/ben-plays-piano/ben-plays-piano.avi
Upload a file to an existing item:
curl --location \
--header "authorization: LOW $accesskey:$secret" \
--upload-file /home/samuel/public_html/intro-to-k.pdf \
http://s3.us.archive.org/sam-s3-test-08/demo-intro-to-k.pdf
Fast GET downloads#
Although the s3 interface supports GET and HEAD, high performance downloads are achieved via the archive web infrastructure:
curl --location http://archive.org/download/sam-s3-test-08/demo-intro-to-k.pdf
note: curl versions curl-7.58 and newer need to add --location-trusted
flag if you are including an Authorization header.
Bucket activity#
After an object had been PUT into a bucket, many things happen in the archive’s petabox content management system (called the catalog). You can see the catalog page for a bucket by looking at:
https://archive.org/history/$bucket
Questions?#
Mail info@archive.org, with the string s3help appearing somewhere in the subject line.