Searching and Finding Archives¶

DataFS allows you to search and locate archives with the following methods: listdir(), filter(), and search(). Let’s look at each method to see how they work.

Using `listdir()`¶

listdir() works just like typlical unix style ls in the sense that it returns all objects subordinate to the specified directory. If your team has used / to organize archive naming then you can explore the archive namespace just as you would explore a directory in a filesystem.

For example if we provide impactlab/conflict/global as an argument to listdir we get the following:

>>> api.listdir('impactlab/conflict/global')
[u'conflict_glob_day.csv']

It looks like we only have one file conflict_global_daily.csv in our directory.

Let’s see what kind of archives we have in our system.

>>> api.listdir('')
[u'impactlab']

It looks like our top level directory is impactlab.

Then if we use impactlab as an argument we see that we have several directory-like groupings below this.

>>> api.listdir('impactlab') 
['labor', 'climate', 'conflict', 'mortality']

Let’s explore conflict to see what kind of namespace groupings we have in there.

>>> api.listdir('impactlab/conflict')
[u'global']

OK. Just one. Now let’s have a look inside the impactlab/conflict/global namespace.

>>> api.listdir('impactlab/conflict/global')
[u'conflict_glob_day.csv']
>>> api.listdir('impactlab/conflict/global/conflict_glob_day.csv')
[u'0.0.1']

We see that if we give a full path with a file extension that we get version numbers of our archives.

Using `filter()`¶

DataFS also lets you filter so you can limit the search space on archive names. With filter() you can use the prefix, path, str, and regex pattern options to filter archives. Let’s look at using the prefix project1_variable1_ which corresponds to the prefix option, the beginning string of a set of archive names. Let’s also see how many archives we have in total by filtering without arguments.

>>> len(list(api.filter()))
125
>>> filtered_list1 = api.filter(prefix='project1_variable1_')
>>> list(filtered_list1) 
[u'project1_variable1_scenario1.nc', u'project1_variable1_scenario2.nc',
u'project1_variable1_scenario3.nc', u'project1_variable1_scenario4.nc',
u'project1_variable1_scenario5.nc']

We see there are 125. By filtering with our prefix we can significantly reduce the number of archives we are looking at.

We can also filter on path. In this case we want to filter all NetCDF files that match a specific pattern. We need to set our engine value to path and put in our search pattern.

>>> filtered_list2 = api.filter(pattern='*_variable4_scenario4.nc',
...     engine='path')
>>> list(filtered_list2) 
[u'project1_variable4_scenario4.nc', u'project2_variable4_scenario4.nc',
u'project3_variable4_scenario4.nc', u'project4_variable4_scenario4.nc',
u'project5_variable4_scenario4.nc']

We can also filter archives with archive names containing a specific string by setting engine to str. In this example we want all archives with the string variable2. The filtering query returns 25 items. Let’s look at the first few.

>>> filtered_list3 = list(api.filter(pattern='variable2', engine='str'))
>>> len(filtered_list3)
25
>>> filtered_list3[:4] 
[u'project1_variable2_scenario1.nc', u'project1_variable2_scenario2.nc',
u'project1_variable2_scenario3.nc', u'project1_variable2_scenario4.nc']

Using `search()`¶

DataFS search() capabilites are enabled via tagging of archives. The arguments of the search() method are tags associated with a given archive. If archives are not tagged, they cannot be searched with the search() method. See Tagging Archives for info on how to tag archives.

If we use search() without arguments, it is the same implementation as filter() without arguments.

Let’s see this in action.

>>> archives_search = list(api.search())
>>> archives_filter = list(api.filter())
>>> len(archives_search)
125
>>> len(archives_filter)
125

Our archives have been tagged with team1, team2, or team3 Let’s search for some archives with tag team3. It brings back 41 archives. So we’ll just look at a few.

>>> tagged_search = list(api.search('team3'))
>>> len(tagged_search)
41
>>> tagged_search[:4] 
[u'project1_variable1_scenario2.nc', u'project1_variable2_scenario1.nc',
u'project1_variable2_scenario3.nc', u'project1_variable3_scenario2.nc']

And lets look at the some of these archives to see what their tags are. We’ll use get_tags()

>>> tags = []
>>> for arch in tagged_search[:4]:
...     tags.append(api.manager.get_tags(arch)[0])
>>> tags
[u'team3', u'team3', u'team3', u'team3']

>>> tagged_search_team1 = list(api.search('team1'))
>>> len(tagged_search_team1)
42
>>> tagged_search_team1[:4] 
[u'project1_variable1_scenario1.nc', u'project1_variable1_scenario4.nc',
u'project1_variable2_scenario2.nc', u'project1_variable2_scenario5.nc']

And how about with tag team1. We see that there are 42 archives with team1 tag.

>>> tagged_search_team1 = list(api.search('team1'))
>>> len(tagged_search_team1)
42
>>> tagged_search_team1[:4] 
[u'project1_variable1_scenario1.nc', u'project1_variable1_scenario4.nc',
u'project1_variable2_scenario2.nc', u'project1_variable2_scenario5.nc']

And let’s use get_tags() to confirm the tags are team1

>>> tags = []
>>> for arch in tagged_search_team1[:4]:
...     tags.append(api.manager.get_tags(arch)[0])
>>> tags
[u'team1', u'team1', u'team1', u'team1']

We want your feedback. If you have improvements or suggestions for the documentation please consider making contributions.

Searching and Finding Archives¶

Using listdir()¶

Using filter()¶

Using search()¶

Using `listdir()`¶

Using `filter()`¶

Using `search()`¶