Configuring DataFS for your Team

This tutorial walks through the process of creating the specification files and setting up resources for use on a large team. It assumes a basic level of familiarity with the purpose of DataFS, and also requires administrative access to any resources you’d like to use, such as AWS.

Set up a connection to AWS

To use AWS resources, you’ll need credentials. These are most easily specified in a credentials file.

We’ve provided a sample file here:

1
2
3
[aws-test]
aws_access_key_id=MY_AWS_ACCESS_KEY_ID
aws_secret_access_key=MY_AWS_SECRET_ACCESS_KEY

This file is located ~/.aws/credentials by default, but we’ll tell AWS how to find it locally using an environment variable for the purpose of this example:

>>> import os
>>>
>>> # Change this to wherever your credentials file is:
... credentials_file_path = os.path.join(
...     os.path.dirname(__file__),
...     'credentials')
...
>>> os.environ['AWS_SHARED_CREDENTIALS_FILE'] = credentials_file_path

Configure DataFS for your organization/use

Now that you have a connection to AWS, you can specify how you want DataFS to work. DataFS borrows the idea of profiles, allowing you to have multiple pre-configured file managers at once.

We’ll set up a test profile called “example” here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Specify a default profile
default-profile: example

# Configure your profiles here
profiles:

    # Everything under this key specifies the example profile
    example:
        
        api:
            
            # Enter user data for each user
            user_config:
                contact: me@email.com
                username: My Name
            
            
        # Add multiple data filesystems to use as 
        # the authoritative source for an archive
        authorities:
            
            # The authority "local" is an OSFS 
            # (local) filesystem, and has the relative
            # path "example_data_dir" as it's root.
            local:
                service: OSFS
                args: [example_data_dir]
            
            # The authority "remote" is an AWS S3FS
            # filesystem, and uses the "aws-test"
            # profile in the aws config file to 
            # connect to resources on Amazon's us-east-1
            remote:
                service: S3FS
                args: ['test-bucket']
                kwargs:
                    region_name: us-east-1
                    profile_name: 'aws-test'
        
        
        # Add one manager per profile
        manager:
            
            # This manager accesses the table 
            # 'my-test-data' in a local instance
            # of AWS's DynamoDB. To use a live 
            # DynamoDB, remove the endpoint_url 
            # specification.
            class: DynamoDBManager
            kwargs:
                resource_args:
                    endpoint_url: 'http://localhost:8000/'
                    region_name: 'us-east-1'
                    
                session_args:
                    profile_name: aws-test
                    
                table_name: my-test-data

Set up team managers and services

Make sure that the directories, buckets, etc. that your services are connecting to exist:

>>> if not os.path.isdir('example_data_dir'):
...     os.makedirs('example_data_dir')

Now, boot up an API and create the archive table on your manager that corresponds to the one specified in your

>>> import datafs
>>> api = datafs.get_api(
...     profile='example',
...     config_file='examples/preconfigured/.datafs.yml')
>>>
>>> api.manager.create_archive_table('my-test-data')

Finally, we’ll set some basic reporting requirements that will be enforced when users interact with the data.

We can require user information when writing/updating an archive. set_required_user_config() allows administrators to set user configuration requirements and provide a prompt to help users:

>>> api.manager.set_required_user_config({
...     'username': 'your full name',
...     'contact': 'your email address'})

Similarly, set_required_archive_metadata() sets the metadata that is required for each archive:

>>> api.manager.set_required_archive_metadata({
...     'description': 'a long description of the archive'})

Attempts by users to create/update archives without these attributes will now fail.

Using the API

At this point, any users with properly specified credentials and config files can use the data api.

From within python:

>>> import datafs
>>> api = datafs.get_api(
...     profile='example',
...     config_file='examples/preconfigured/.datafs.yml')
>>>
>>> archive = api.create(
...     'archive1',
...     authority_name='local',
...     metadata = {'description': 'my new archive'})

Note that the metadata requirements you specified are enforced. If a user tries to skip the description, an error is raised and the archive is not created:

>>> archive = api.create(
...     'archive2',
...     authority_name='local')
...  
Traceback (most recent call last):
...
AssertionError: Required value "description" not found. Use helper=True or
the --helper flag for assistance.
>>>
>>> print(next(api.filter()))
archive1

Setting User Permissions

Users can be managed using policies on AWS’s admin console. An example policy allowing users to create, update, and find archives without allowing them to delete archives or to modify the required metadata specification is provided here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "dynamodb:BatchGetItem",
                "dynamodb:BatchWriteItem",
                "dynamodb:DescribeTable",
                "dynamodb:GetItem",
                "dynamodb:PutItem",
                "dynamodb:Query",
                "dynamodb:Scan",
                "dynamodb:UpdateItem"
            ],
            "Resource": [
                "arn:aws:dynamodb:*:*:table/my-test-data"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "dynamodb:BatchGetItem",
                "dynamodb:DescribeTable",
                "dynamodb:GetItem",
                "dynamodb:Query",
                "dynamodb:Scan"
            ],
            "Resource": [
                "arn:aws:dynamodb:*:*:table/my-test-data.spec"
            ]
        }
    ]
}

A user with AWS access keys using this policy will see an AccessDeniedException when attempting to take restricted actions:

>>> import datafs
>>> api = datafs.get_api(profile='user') 
>>>
>>> archive = api.get_archive('archive1')
>>>
>>> archive.delete() 
Traceback (most recent call last):
...
botocore.exceptions.ClientError: An error occurred (AccessDeniedException)
when calling the DeleteItem operation: ...

Teardown

A user with full privileges can completely remove archives and manager tables:

>>> api.delete_archive('archive1')
>>> api.manager.delete_table('my-test-data')