Searching and Finding Archives¶
DataFS allows you to search and locate archives with the following methods:
listdir()
, filter()
, and search()
. Let’s look at each method to see how they work.
Using listdir()
¶
listdir()
works just like typlical unix style ls
in the sense that it returns all objects subordinate to the specified directory. If your team has used /
to organize archive naming then you can explore the archive namespace just as you would explore a directory in a filesystem.
For example if we provide impactlab/conflict/global
as an argument to listdir
we get the following:
>>> api.listdir('impactlab/conflict/global')
[u'conflict_glob_day.csv']
It looks like we only have one file conflict_global_daily.csv
in our directory.
Let’s see what kind of archives we have in our system.
>>> api.listdir('')
[u'impactlab']
It looks like our top level directory is impactlab
.
Then if we use impactlab
as an argument we see that we have several directory-like groupings below this.
>>> api.listdir('impactlab')
['labor', 'climate', 'conflict', 'mortality']
Let’s explore conflict
to see what kind of namespace groupings we have in there.
>>> api.listdir('impactlab/conflict')
[u'global']
OK. Just one. Now let’s have a look inside the impactlab/conflict/global
namespace.
>>> api.listdir('impactlab/conflict/global')
[u'conflict_glob_day.csv']
>>> api.listdir('impactlab/conflict/global/conflict_glob_day.csv')
[u'0.0.1']
We see that if we give a full path with a file extension that we get version numbers of our archives.
Using filter()
¶
DataFS also lets you filter so you can limit the search space on archive names. With filter()
you can use the prefix
, path
, str
, and regex
pattern options to filter archives.
Let’s look at using the prefix project1_variable1_
which corresponds to the prefix
option, the beginning string of a set of archive names. Let’s also see how many archives we have in total by filtering without arguments.
>>> len(list(api.filter()))
125
>>> filtered_list1 = api.filter(prefix='project1_variable1_')
>>> list(filtered_list1)
[u'project1_variable1_scenario1.nc', u'project1_variable1_scenario2.nc',
u'project1_variable1_scenario3.nc', u'project1_variable1_scenario4.nc',
u'project1_variable1_scenario5.nc']
We see there are 125. By filtering with our prefix we can significantly reduce the number of archives we are looking at.
We can also filter on path
. In this case we want to filter all NetCDF files that match a specific pattern. We need to set our engine
value to path
and put in our search pattern.
>>> filtered_list2 = api.filter(pattern='*_variable4_scenario4.nc',
... engine='path')
>>> list(filtered_list2)
[u'project1_variable4_scenario4.nc', u'project2_variable4_scenario4.nc',
u'project3_variable4_scenario4.nc', u'project4_variable4_scenario4.nc',
u'project5_variable4_scenario4.nc']
We can also filter archives with archive names containing a specific string by setting engine
to str
. In this
example we want all archives with the string variable2
. The filtering query returns 25 items. Let’s look at the first few.
>>> filtered_list3 = list(api.filter(pattern='variable2', engine='str'))
>>> len(filtered_list3)
25
>>> filtered_list3[:4]
[u'project1_variable2_scenario1.nc', u'project1_variable2_scenario2.nc',
u'project1_variable2_scenario3.nc', u'project1_variable2_scenario4.nc']
Using search()
¶
DataFS search()
capabilites are enabled via tagging of archives. The arguments of the search()
method are tags associated with a given archive. If archives are not tagged, they cannot be searched with the search()
method. See Tagging Archives for info on how to tag archives.
If we use search()
without arguments, it is the same implementation as filter()
without arguments.
Let’s see this in action.
>>> archives_search = list(api.search())
>>> archives_filter = list(api.filter())
>>> len(archives_search)
125
>>> len(archives_filter)
125
Our archives have been tagged with team1
, team2
, or team3
Let’s search for some archives with tag team3
. It brings back 41 archives. So we’ll just look at a few.
>>> tagged_search = list(api.search('team3'))
>>> len(tagged_search)
41
>>> tagged_search[:4]
[u'project1_variable1_scenario2.nc', u'project1_variable2_scenario1.nc',
u'project1_variable2_scenario3.nc', u'project1_variable3_scenario2.nc']
And lets look at the some of these archives to see what their tags are. We’ll use
get_tags()
>>> tags = []
>>> for arch in tagged_search[:4]:
... tags.append(api.manager.get_tags(arch)[0])
>>> tags
[u'team3', u'team3', u'team3', u'team3']
>>> tagged_search_team1 = list(api.search('team1'))
>>> len(tagged_search_team1)
42
>>> tagged_search_team1[:4]
[u'project1_variable1_scenario1.nc', u'project1_variable1_scenario4.nc',
u'project1_variable2_scenario2.nc', u'project1_variable2_scenario5.nc']
And how about with tag team1
. We see that there are 42 archives with team1
tag.
>>> tagged_search_team1 = list(api.search('team1'))
>>> len(tagged_search_team1)
42
>>> tagged_search_team1[:4]
[u'project1_variable1_scenario1.nc', u'project1_variable1_scenario4.nc',
u'project1_variable2_scenario2.nc', u'project1_variable2_scenario5.nc']
And let’s use get_tags()
to confirm the tags are team1
>>> tags = []
>>> for arch in tagged_search_team1[:4]:
... tags.append(api.manager.get_tags(arch)[0])
>>> tags
[u'team1', u'team1', u'team1', u'team1']
We want your feedback. If you have improvements or suggestions for the documentation please consider making contributions.