Using DataFS with AWS’s S3¶
Use this tutorial to build a DataFS server system using MongoDB and a Simple Storage Service such as AWS’s S3.
Running this example¶
To run this example:
- Create a MongoDB server by following the MongoDB’s Tutorial installation and startup instructions.
- Start the MongoDB server (e.g.
mongod --dbpath . --nojournal
) - Follow the steps below
Set up the workspace¶
We need a few things for this example:
>>> from datafs.managers.manager_mongo import MongoDBManager
>>> from datafs import DataAPI
>>> from fs.tempfs import TempFS
>>> import os
>>> import tempfile
>>> import shutil
>>>
>>> # overload unicode for python 3 compatability:
>>>
>>> try:
... unicode = unicode
... except NameError:
... unicode = str
This time, we’ll import PyFilesystem’s S3 Filesystem abstraction:
>>> from fs.s3fs import S3FS
Additionally, you’ll need MongoDB and pymongo installed and a MongoDB instance running.
Create an API¶
Begin by creating an API instance:
>>> api = DataAPI(
... username='My Name',
... contact = 'my.email@example.com')
Attach Manager¶
Next, we’ll choose an archive manager. DataFS currently supports MongoDB and DynamoDB managers. In this example we’ll use a local MongoDB manager. Make sure you have a MongoDB server running, then create a MongoDBManager instance:
>>> manager = MongoDBManager(
... database_name = 'MyDatabase',
... table_name = 'DataFiles')
If this is the first time you’ve set up this database, you’ll need to create a table:
>>> manager.create_archive_table('DataFiles', raise_on_err=False)
All set. Now we can attach the manager to our DataAPI object:
>>> api.attach_manager(manager)
Attach Service¶
Now we need a storage service. Let’s attach the S3FS filesystem we imported:
>>> s3 = S3FS(
... 'test-bucket',
... aws_access_key='MY_KEY',
... aws_secret_key='MY_SECRET_KEY')
>>>
>>> api.attach_authority('aws', s3)
Create archives¶
Now we can create our first archive. An archive must
have an archive_name. In addition, you can supply any
additional keyword arguments, which will be stored as
metadata. To suppress errors on re-creation, use the
raise_on_err=False
flag.
>>> api.create(
... 'my_remote_archive',
... metadata = dict(description = 'My test data archive'))
<DataArchive aws://my_remote_archive>
View all available archives¶
Let’s see what archives we have available to us.
>>> print(next(api.filter()))
my_remote_archive
Retrieve archive metadata¶
Now that we have created an archive, we can retrieve it from anywhere as long as we have access to the correct service. When we retrieve the archive, we can see the metadata that was created when it was initialized.
>>> var = api.get_archive('my_remote_archive')
We can access the metadata for this archive through the archive’s
get_metadata()
method:
>>> print(var.get_metadata()['description'])
My test data archive
Add a file to the archive¶
An archive is simply a versioned history of data files. So let’s get started adding data!
First, we’ll create a local file, test.txt
, and put
some data in it:
>>> with open('test.txt', 'w+') as f:
... f.write('this is a test')
Now we can add this file to the archive:
>>> var.update('test.txt')
This file just got sent into our archive! Now we can delete the local copy:
>>> os.remove('test.txt')
Reading from the archive¶
Next we’ll read from the archive. That file object returned by
var.open()
can be read just like a regular file
>>> with var.open('r') as f:
... print(f.read())
...
this is a test
Updating the archive¶
Open the archive and write to the file:
>>> with var.open('w+') as f:
... res = f.write(unicode('this is the next test'))
Retrieving the latest version¶
Now let’s make sure we’re getting the latest version:
>>> with var.open() as f:
... print(f.read())
...
this is the next test
Looks good!
Cleaning up¶
>>> var.delete()
>>> api.manager.delete_table('DataFiles')
Next steps¶
Using Other Services describes setting up DataFS for other filesystems, such as sftp or http.