The nilearn.datasets package embeds tools to fetch and load datasets. It comes with a set of several datasets that can be easily downloaded.
The purpose of this API is to download datasets, preprocess them if needed, but do it once and for all, so datasets must be stored somewhere.
There are 3 simple ways to determine where to stock datasets. Here are these rules ordered by priority (the first rule override the others and so on):
A generic dataset fetching function is available (fetch_dataset). Please see its documentation to learn how to use it.
If you consider using an online public dataset, do not hesitate to follow the steps below to create a dataset fetching function for this dataset. Any pull request is welcome.
Writing a dataset fetching function is rather easy if the data do not require complex preprocessing. Take special care of sharing conditions of the dataset you want to load. If a registration is required, contact the dataset provider to know if there is a way to get round it (put data on a public server or create an account that will be used by the script to download it).
Creating your function is straightforward. Your function have to take at least two parameters :
You don’t have to worry about these parameters. They just have to be passed to some helper functions. You can obviously add custom parameters to fit your* needs. For example:
With the definition function comes the associated docstring. As well as parameters definition, any information about data structure and/or paper references is welcome.
The first step is to define of which files is composed the dataset. A simple array is enough.
Now, we try to load the files specified before. Is they cannot be found, we will try to download the dataset. All these steps can be done simply thanks to helper functions :
If needed, you can preprocess the dataset. As many datasets are in matlab or Nifti format, reformatting it in a more user-friendly format (like numpy arrays) is encouraged.
A convenient way to return a dataset is to use the Bunch structure which encapsulte it and provide easy access to all data fields.