What is a dataset ?
Introduction
A dataset is generally intended as a “collection of related data”. Unfortunately, a more specific definition of what is a geographical dataset has not been identified and different people, if asked what a dataset is, might easily give different answers. The perception of a GIS dataset often depends on the project objectives, the institutional and software environment used for the dataset origination. Sometimes the term dataset can refer to a collection of different themes that belong to the same geographic area, sometimes to a collection of similar themes that belong to different geographical areas. There is not a unique definition of dataset, because there are many different schemes to construct a dataset. GIS organizations create data for specific objectives within a project and manage them following their GIS technical needs. Therefore different GIS datasets might show different content and hierarchical structure.
The structure of the elements belonging to a dataset might reveal a more or less complex hierarchy. Several elements of a dataset can be grouped when have similar characteristics to delineate the organization of the dataset. The term granularity defines the level of details at which the data collection is considered. The granularity of a dataset ranges from the whole dataset to the element features in the dataset. Data granularity is of particularly useful to organize the metadata publishing of a dataset. Within a dataset different feature elements might share many of the attributes that need to be described in the metadata. Determining dataset granularity might seem confusing, but it can supply a more efficient and faster implementation of metadata for the whole dataset.
The CSDGM standard, like all the GIS metadata standards, defines the criteria to characterize individual element features, leaving full freedom to choose the granularity to which the standard can be implemented within a dataset. To correctly apply a metadata standard to a dataset, it helps to understand what the single elements share in common and how they could integrate inside the dataset.
The efficient metadata management of a geographical dataset is implemented in such a way that most of the metadata information can flow from coarse level of granularity down to the individual elements of the dataset. Most of the data within the dataset of an organization have great commonality for two required metadata attributes: contact information and distribution information. These elements can be recorded once for the whole dataset and this information can be inherited to all the elements of the dataset. Different metadata softwares support the use of metadata templates with common information of contact and distribution information. The inheritance of data using metadata templates can simplify the process of data entry, update and reporting inside a dataset.
Series-level Metadata
As common practice, GIS data creators edit metadata documentation separately for any individual image or coverage in the database. However, by observing the data granularity it is possible to gather GIS data under an umbrella parent, where common metadata can be derived for a series of related spatial layers, and such metadata or at least a relevant part of it can be inherited to the different layers of the data set. If there are only few differences between members of the dataset, such as name, date, or geographical coordinates, it is worthwhile to simply create series-level metadata. Series-level metadata is one document for multiple data; the metadata pertains to the series of data, rather than specific records. It is advisable, whenever possible, to apply one metadata record for one set of related layers, which provides a complete and continuous description rather than a set of unrelated metadata.
The global digital SRTM topographic dataset is subdivided in a set of 5x5 degrees tiles. Is each tile considered a single independent feature, or the entire collection can reasonable be considered a sound entity? In fact the only difference between each tile is the bounding coordinates of the covered area. There are two options to create metadata for this dataset: create a metadata document for each tile, or a metadata document for the whole SRTM dataset with a reference to the geographical variation of each tile. Both are valid alternatives that can be properly implemented to document the dataset; however, creating a unique metadata for the whole dataset is more time-efficient, easier to implement and provide to the users a better global overview of the whole SRTM dataset. Individual differences between the tiles can be stored in a simple table inside the metadata. Tiles of the SRTM dataset are a good example of data that can be potential for series-level metadata, but there are many potential candidates like satellite images taken from satellites for a specific area at different dates.
The global digital SRTM topographic dataset distributed from the CSI site is an example where series-level metadata can be applied. The whole dataset is subdivided in a set of 5x5 degrees tiles. Each tile could be considered a single independent feature of the dataset, or the entire collection of tiles can reasonable be considered a sound unique entity. In fact the only difference between each tile is the spatial extent of the area covered from the tile. Two possible strategies can be implemented to create metadata for this dataset: create a metadata document for each tile, or a metadata document for the whole SRTM dataset with a reference to the geographical variation of each tile. Both are valid alternatives; however, creating a unique metadata for the whole dataset is more time-efficient, easier to implement and provide to the users a better global overview of the whole SRTM dataset. Individual differences between the tiles can be stored in a simple table inside the metadata. Tiles of the SRTM dataset are a good example of a candidate for series-level metadata, but there are many potential candidates like sequences of satellite images taken for the same area at different dates.
Feature-Level Metadata
This the most common instance of metadata, describing elements of a dataset as single feature class. Feature-level data focus on the description of individual basic elements in a dataset. There are different types of features-level data: remote sensing images, shapefiles, grids, tins, tables, etc.. Each one of these GIS layers contain an individual feature class. ArcInfo coverages and CAD files are slightly more complex because they might contain more than one feature class inside their structures. ArcInfo coverages and CAD files are vector format files, which can stores several types of features. ArcInfo coverage that contain a feature for polygons, will also contain a feature for the line that delineate the boundaries of the polygons. Although CAD and ArcInfo files can contain more than one feature, they are conventionally treated as feature-level data for metadata publishing purposes. Metadata editors, like ArcCatalog in compliance of these rules create one metadata for the entire ArcInfo or CAD coverage. Within the metadata record, information is stored to document each feature class that comprise the ArcInfo coverage or the CAD file.
Geospatial and attribute features of geographic data
Geographic data uses different feature types (raster, points, line or polygons) to uniquely identify the location and/or the geographical boundaries of spatial entities that exist on the earth surface. Point features locate spatial entities such as cities or raingauge stations, lines represent linear features such as streams or roads, polygons identifies the boundaries of areas where themes have equal values and are used for entities such as land use or soil classification. In addition to spatial properties, geographic data store attribute data that describe the properties and characteristics of the spatial features. In a GIS, attribute data are stored in a table and are linked to the spatial features. Attribute data are often referred as non-spatial data since do not represent themselves any of the spatial characteristics of the features. Spatial data describe where is the feature, while attribute data describe what is the feature. The figure below shows an example of the linkage between spatial and attributes data. The figure illustrates a polygon coverage describing soil types. One specific soil polygon is selected and the linked attributes are highlighted in the attribute table in yellow. The attribute table contains several attributes that describe different soil characteristics, such as landform, surface and subsurface texture, depth of the bedrock, drainage surface permeability, infiltration, and so on.

Tabular data is an easy and powerful tool associated with spatial data that allows any GIS user to display the spatial patterns described by the attributes present in the tabular data associated with the geospatial data. Tabular data themselves need to be quite schematic, efficient and fast to read; attribute definition and attribute values are shortened and often coded. Attribute tables can be prosperous of information but also extremely confusing for novice users that don’t have experience with the methodology used to characterize and define the attribute names and values of the table. Geographical data have been often relatively undocumented about the meaning of the attributes and codes used in the tabular data. Tabular data without the proper description are almost useless. If metadata for the geographical data is available, users can find all the information necessary to interpret the attributes characteristics of the table, as the coding used to define attribute values and the methodology used to define classes of values (such as High, medium and low) for attribute values. It is very important that the characteristics of the tabular data are completely and clearly described in the metadata.