Searching and Finding Archives

DataFS allows you to search and locate archives with the following methods: listdir(), filter(), and search(). Let’s look at each method to see how they work.

Using listdir()

listdir() works just like typlical unix style ls in the sense that it returns all objects subordinate to the specified directory. If your team has used / to organize archive naming then you can explore the archive namespace just as you would explore a directory in a filesystem.

For example if we provide impactlab/conflict/global as an argument to listdir we get the following:

>>> api.listdir('impactlab/conflict/global')
[u'conflict_glob_day.csv']

It looks like we only have one file conflict_global_daily.csv in our directory.

Let’s see what kind of archives we have in our system.

>>> api.listdir('')
[u'impactlab']

It looks like our top level directory is impactlab.

Then if we use impactlab as an argument we see that we have several directory-like groupings below this.

>>> api.listdir('impactlab') 
['labor', 'climate', 'conflict', 'mortality']

Let’s explore conflict to see what kind of namespace groupings we have in there.

>>> api.listdir('impactlab/conflict')
[u'global']

OK. Just one. Now let’s have a look inside the impactlab/conflict/global namespace.

>>> api.listdir('impactlab/conflict/global')
[u'conflict_glob_day.csv']
>>> api.listdir('impactlab/conflict/global/conflict_glob_day.csv')
[u'0.0.1']

We see that if we give a full path with a file extension that we get version numbers of our archives.

Using filter()

DataFS also lets you filter so you can limit the search space on archive names. With filter() you can use the prefix, path, str, and regex pattern options to filter archives. Let’s look at using the prefix project1_variable1_ which corresponds to the prefix option, the beginning string of a set of archive names. Let’s also see how many archives we have in total by filtering without arguments.

>>> len(list(api.filter()))
125
>>> filtered_list1 = api.filter(prefix='project1_variable1_')
>>> list(filtered_list1) 
[u'project1_variable1_scenario1.nc', u'project1_variable1_scenario2.nc',
u'project1_variable1_scenario3.nc', u'project1_variable1_scenario4.nc',
u'project1_variable1_scenario5.nc']

We see there are 125. By filtering with our prefix we can significantly reduce the number of archives we are looking at.

We can also filter on path. In this case we want to filter all NetCDF files that match a specific pattern. We need to set our engine value to path and put in our search pattern.

>>> filtered_list2 = api.filter(pattern='*_variable4_scenario4.nc',
...     engine='path')
>>> list(filtered_list2) 
[u'project1_variable4_scenario4.nc', u'project2_variable4_scenario4.nc',
u'project3_variable4_scenario4.nc', u'project4_variable4_scenario4.nc',
u'project5_variable4_scenario4.nc']

We can also filter archives with archive names containing a specific string by setting engine to str. In this example we want all archives with the string variable2. The filtering query returns 25 items. Let’s look at the first few.

>>> filtered_list3 = list(api.filter(pattern='variable2', engine='str'))
>>> len(filtered_list3)
25
>>> filtered_list3[:4] 
[u'project1_variable2_scenario1.nc', u'project1_variable2_scenario2.nc',
u'project1_variable2_scenario3.nc', u'project1_variable2_scenario4.nc']