Procedures for data linkage in Switzerland
Linkage of administrative data: procedure of the Federal Statistical Office
To link datasets with at least one dataset coming from the Federal Administration, the linkage is carried out by the Federal Statistical Office. Data linking for third parties is subject to a clearly defined procedure to guarantee a high level of data protection.
To submit a linkage application to the FSO, the procedure is as follows (the detailed procedure is available on the FSO website)
- The user of the data completes a linkage request form, which is available on the FSO website. The form can be accessed and downloaded by following this link.
- The completed form should be returned to this address.
- The FSO examines the application and, if necessary, contacts the user for additional information.
For the application to be validated, the following conditions must be met.
For the FSO to carry out a linkage on behalf of a third party, the following criteria must be met (source FSO):
- Linkages should be used exclusively for public statistics-related or scientific purposes, not for administrative or other objectives, regardless of who may be the applicant.
- Linkages must conform to legal requirements and the data included must be in accordance with the FStatA..
- Security and data protection must be guaranteed, especially for sensitive data:
- Only anonymous data is supplied, i.e. data that does not allow an individual to be identified.
- Data should not be de-anonymized or linked with other data.
- Once the analysis has been made the data is to be deleted or returned to the FSO.»
Methodological and technical criteria:
- The data to be linked and the data resulting from the linkage are of sufficient quality, the chosen method is appropriate and the data used are suitable for the topic under study.
- The data sources to be linked contain identical (anonymised) identifiers.The request for linkage must also meet the following criteria:
To ensure that the application will be taken into consideration the following requirements must be met:
- The application relates to the applicant’s work for a recognised research institute (e.g. a university) or for a federal, cantonal or local authority organisation.
- The application is made in the framework of a project that has a statistical (not administrative) objective. In the case of scientific objectives these should be briefly summarised in the application, with the relevance and general interest of the research (including aim of the project, detailed description of the data to be used as output, details of specific utilisation and any plans for publication).
- The application concerns statistical data of the Federal Administration. » (Source : FSO) https://www.bfs.admin.ch/bfs/en/home/services/data-linkages/for-third-parties.html
4. Once the request has been made:
The FSO accepts the application: a contract, sent to the user by the FSO, must be signed by both parties. If researchers/users want to link the administration’s datasets with their own datasets, they must send the latter to the FSO so that the FSO can carry out the linkage. The linkage is then carried out by creating a linkage key, and the FSO then transmits the anonymised data to the user.
In case the FSO refuses the application: the FSO informs the user of the reasons for its negative decision.
Linkage of research and private data
The linkage of research data and private data is not regulated as for the administrative one. There are no standards or even principles governing how such data should be used for research.
If the data are made available on a data archiving platform, then the user can download and use them in accordance with the specified restrictions and conditions of use. Otherwise, the user of the data may contact the owner of the data to access the data.
Methodology for data linkage
Linkage of individual data
Data linking involves creating pairs of records of similar individuals or entities in the different files to be matched. To identify individuals or entities that match in different files, there are a variety of methods.
Deterministic data linkage
Within datasets, individuals are often described by an identifier, or failing that, by one or more identifying variables, such as information on civil status, surnames and forenames, places of residence, etc. Data linkage is said to be deterministic if a common unique identifier or a combination of variables allows an exact comparison between individuals or entities in different datasets.
- When unique identifiers are common to the different given sets to be matched, a simple deterministic linkage can be performed: the data are linked based on of the unique identifiers common to the different datasets.
- In the absence of a unique identifier, deterministic linkage can be carried out using indirect identifiers, which are then a combination of variables allowing the identification of individuals or similar entities in different datasets. These variables must be complete, accurate and robust.
To perform linkage based on a combination of variables, data harmonisation is an essential preliminary step. It ensures that the potential identifiers of the different data sources can match each other.
Some basic rules for harmonising data :
– Clean up the data by converting letters to capital letters and eliminating accents.
– Remove unnecessary or redundant words and unwanted elements from strings
– Convert words to standardised spelling.
– Recode strings to normalize them around common values, for example, if diminutives are sometimes used and sometimes not used, to name the same entity, diminutives can be registered by the whole name, to normalize the data.
To go further on data harmonisation:
- QuickCharmStats developed by GESIS to “reduce the time and effort that researchers devote to harmonising and recoding variables for statistical analysis
- Biobank tools of the European FP7 programme BioSHaRE (Biobank Standardisation and Harmonisation for Research Excellence in the European Union) developed in the field of medical sciences to assess the compatibility of the data collected and the link between cohorts from different European countries.
Probabilistic data linkage
In the absence of a single variable or combination of variables that would allow for deterministic linkages, another family of methods consists of assessing the probability that a pair of separate file records may correspond to the same individual or entity. Probabilistic linkages assign weights to each pair of records indicating the probability of an actual match. There are a wide variety of approaches to probabilistic data linkage.
There are tools that can assist probabilistic linkages:
The Github page presents different record matching software :
- Atylmo https://github.com/pierrepita/atyimo
- Dedupe https://github.com/dedupeio/dedupe
- fastLink https://cran.r-project.org/web/packages/fastLink/index.html
- FEBRL https://sourceforge.net/projects/febrl/
- FRIL http://fril.sourceforge.net/
- FuzzyMatcher https://pypi.python.org/pypi/fuzzymatcher
- JedAI http://jedai.scify.org/
- PRIL https://github.com/LSHTM-ALPHAnetwork/PIRL_RecordLinkageSoftware
- RecordLinkage (R) https://github.com/J535D165/recordlinkage
- RELAIS https://www.istat.it/en/methods-and-tools/methods-and-it-tools/process/processing-tools/relais
- ReMaDDer http://remadder.findmysoft.com/
- Splink https://github.com/moj-analytical-services/splink
- The Link King http://www.the-link-king.com/
The article “Karr, A. F., Taylor, M. T., West, S. L., Setoguchi, S., Kou, T. D., Gerhard, T., & Horton, D. B. (2019). Comparing record linkage software programs and algorithms using real-world data. PloS one, 14(9), e0221459.” https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221459 also offers a comparison of the following four linkage software packages:
- R (Version 3.4.0, RecordLinkage package https://cran.r-project.org/web/packages/RecordLinkage/index.html
- Merge ToolBox (MTB, Version 0.75 https://www.uni-due.de/~hq0215/documents/mtb_gettingstarted.pdf
- Curtin University Probabilistic Linkage Engine (CUPLE, shortened in figures and tables to CU) https://healthsciences.curtin.edu.au/health-sciences-research/research-institutes-centres/centre-for-data-linkage/
- Link Plus (LP, Version 2.0) https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm
To perform data linkage with R, you can also consult the tutorials here (https://cran.r-project.org/web/packages/RecordLinkage/index.html) et ici (https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf)
Statistics Canada also offers the tool G-Coup https://www150.statcan.gc.ca/n1/fr/catalogue/10H0036
Contextual data linkage
What are contextual data ?
Contextual data refer to the environment in which an individual or entity evolves. The environment is a macro level that results from the aggregation of the practices of individuals at the micro level, and from the interactions between individuals. The environment described by contextual data can be a territory at different scales (neighbourhood, city, district, canton, country, etc.) or an institutional structure (organisation such as a company, school, etc.) with specific rules and processes that govern the way individuals act (Johnson et al. 2002). The environment can also be a network formed by interactions between individuals or entities (Pumain, 2003).
What is the use (purpose?) of contextual data?
Within this environment, implicit practices and behaviours induce co-evolution or behavioural interdependence of individuals or entities belonging to the same environment (Huckfeldt and Sprague, 1993; Johnson et al. 2002). The macro level of territories and organizations can thus sometimes be more relevant than the individual level for understanding socio-behavioural phenomena or phases of history (Sprague, 1982; Pumain, 2007) and for medical sciences, provides a better understanding of the links between the environment and the development of diseases. Contextual data and the resulting multi-level analysis allow to objectivise the dependence between an individual’s behaviour and contextual processes, as well as to integrate the impact of location, the delimitation of contexts or environments or the effects of time on the results of the analysis (Grossetti, 2011).
Grossetti, M. (2011). L’espace à trois dimensions des phénomènes sociaux. Échelles d’action et d’analyse. SociologieS. https://journals.openedition.org/sociologies/3466
Huckfeldt, R., Plutzer, E., & Sprague, J. (1993). Alternative contexts of political behavior: Churches, neighborhoods, and individuals. The Journal of Politics, 55(2), 365-381. https://www.journals.uchicago.edu/doi/abs/10.2307/2132270
Johnson, M., Shively, W. P., & Stein, R. M. (2002). Contextual data and the study of elections and voting behavior: connecting individuals to environments. Electoral Studies, 21(2), 219-233. https://www.sciencedirect.com/science/article/abs/pii/S0261379401000191
Pumain, D. (2003). Du local au global, une géographie sans échelles ? Cybergeo: European Journal of Geography. Éditoriaux, mis en ligne le 12 septembre 2003 http://journals.openedition.org/cybergeo/594
Linkage with contextual data then consists of linking information about an individual (micro level) with information about his or her environment (macro level).
Linkage of contextual data in the form of classes/typology
Socio-professional typologies and nomenclatures of socio-economic activities are sets of jobs or economic activities grouped into categories, the construction of which is based on the degree of proximity between the jobs or economic activities they describe. These categories are often divided into sub-categories, which describe sets of jobs or activities that are closer to each other than to other jobs or activities in their category. These categories are described by multi-digit codes. The length of the code (number of digits) varies according to the level of the category. The main categories are described by a number, usually a 1-digit code, and the n-sub-categories are described by an n-digit code.
The linkage with contextual data can be a simple deterministic linkage based on the code of the categories of socio-professional and economic activities. If such a code is not included in the datasets, they can be linked based on a series of indicators relating, for example, to the names and location of professional activities and enterprises. As with individual data, given the variability in the writing of names of enterprises, activities and occupations, the data should be harmonized (cross-referenced to the corresponding paragraph) before linkage.
Linkage of spatialised data – using GIS
For territorial data, the process can be similar if the data can be associated with a territorial code (e.g. the code of the municipality or canton), or with identifying variables such as the name of the municipalities and their subdivision into higher territorial units. The specificity of territorial data is that linkage can also be carried out using a geographic information system (GIS). This allows to avoid errors in a linkage based on a set of nominative variables in the absence of a common code for the datasets. Linking with a GIS is achieved by geolocalising the locations of individuals and spatialised contextual information, projecting them into a GIS and performing a spatial join.
1/ Geolocation of information :
Geolocate information, whether addresses or names of municipalities, districts etc. Geocoding consists of assigning a latitude and longitude to project a locality on a map or in a GIS. In the case of an address, it is a precise geolocation. In the case of a commune or district name, a latitude and longitude correspond to the centroid of the territorial unit.
To geolocate the data, free software is available which can be used directly online to geolocate the data such as:
- Batchgeo (https://fr.batchgeo.com/features/geocode-addresses/)
- French public data open platform (https://adresse.data.gouv.fr/csv)
- It can also be realised with GIS software:
- QGIS, with the Geocode and MMQGIS plugins (Tutorials for address geocoding with QGIS are available (in French) on the site of the Blog idgeo and on the site of the Blog “SIG and Territory”. https://www.sigterritoires.fr/index.php/geocodage-dadresses-avec-qgis-2-8/.)
- ArcGIS desktop with ArcMap address geocoding dialog box: https://desktop.arcgis.com/fr/arcmap/latest/manage-data/geocoding/geocoding-a-table-of-addresses-about.htm
- ArcGIS (pro) with the Geocode tool Addresses in the Geoprocessing window: https://pro.arcgis.com/en/pro-app/latest/help/data/geocoding/tutorial-geocode-a-table-of-addresses.htm
2/ General approach to making a spatial joint:
- Download the files of the territorial delimitation layers (countries, cantons, districts, communes …) in a GIS-readable Shapefile format. Once these two steps have been completed, a spatial link under GIS will allow the geocoded data to be linked to the GIS territorial delimitation layer files.
3/ Tutorials for creating a Space Join under QGIS, ArcGIS Desktop and ArcGIS Pro:
Find contextual data on Switzerland
- Territorial typologies
|Name||Information provided||Digits and classes||Link||Institution|
|NUTS NOMENCLATURE OF TERRITORIAL UNITS FOR STATISTICS||NUTS 1: major socio-economic regions|
NUTS 2: basic regions for the application of regional policies
NUTS 3: small regions for specific diagnoses
|NUTS 1: 3 digits (2 letters, 1 no.) NUTS 2: 4 digits (2 letters, 2 no) NUTS 3: 5 digits (2 letters, 2 no, 1 lett.)||https://ec.europa.eu/|
|Agglomerations and Centres outside the agglomeration (2012)||Agglomerations and number of communes contained in agglomerations.||Code agglomération : 3 to 4 digits|
Codes centres hors agglomération : 5 digits
|Institutional levels : Municipalities/Communes||For municipalities in agglomerations: agglomeration code For oriented communes: code of the first and second agglomeration centre.||SFSO code of Commune: 2 to 4 digits||https://www.bfs.admin.ch/bfs/fr/home/statistiques/themes-transversaux/analyses-spatiales/niveaux-geographiques/typologies-territoriales.assetdetail.188853.html||SFSO|
- Socio-professional typologies
|Name||Information provided||Digits and classes||Link||Institution|
|Socio-professional categories 2010||17 categories||Codes of 1 to 3 digits depending on the level of detail||https://www.bfs.admin.ch/bfs/fr/home/statistiques/travail-remuneration/nomenclatures/spk2010.assetdetail.3962878.html||SFSO|
|International Standard Classification of Occupations – ISCO 88 (COM)||9 categories||Codes of 1 to 4 digits depending on the level of detail||https://www.bfs.admin.ch/bfs/fr/home/statistiques/travail-remuneration/nomenclatures/isco88com.html||SFSO|
|Swiss Nomenclature of Occupations 2000||9 categories||Codes of 1 to 5 digits depending on the level of detail||https://www.bfs.admin.ch/bfs/fr/home/statistiques/travail-remuneration/nomenclatures/sbn2000.html||SFSO|
|Swiss Nomenclature of Occupations CH-ISCO-19||9 categories||Codes of 1 to 5 digits depending on the level of detail||https://www.bfs.admin.ch/bfs/fr/home/statistiques/travail-remuneration/nomenclatures/ch-isco-19.html||SFSO|
- Typologies of economic activities
|General Classification of Economic Activities / Nomenclature générale des activités économique (NOGA)||Socio-professional categories 2010 Integrated into the Register of Companies and Establishments||1 to 5 digits 794 economic activities||https://www.bfs.admin.ch/bfs/fr/home/statistiques/travail-remuneration/nomenclatures/spk2010.assetdetail.3962878.html||SFSO|
|International Standard Classification of Occupations (ISCO-08)||International Standard Classification of Occupations The ISCO-08 structure is the result of the aggregation of the ISCO 88 unit groups.||436 unit group 130 minor groups 43 major group||https://www.bfs.admin.ch/bfs/fr/home/statistiques/travail-remuneration/nomenclatures/isco88com.html||SFSO|
|Catalogues et banques de données||Population, economy, land use, environment Country, Communes, cantons, districts Données sous forme de tableaux interactifs. Données sélectionnables par thèmes, enquêtes, niveau géographiques et mots-clés||https://www.bfs.admin.ch/bfs/fr/home/statistiques/catalogues-banques-donnees/donnees.html||SFSO|
|Geodonnées GEOSTAT||Limites communales, statistique de la population agrégée à l’hectare, statistique structurelle des entreprises, statistique de superficies, données relatives aux sols||https://www.bfs.admin.ch/bfs/fr/home/services/geostat/geodonnees-statistique-federale.html||SFSO|
|Données de bases des unités administratives||“Référentiels de coordonnées: série de coordonnées (x, y, z) point géodésique horizontal et vertical AGNES, Géoïde en CH1903, Géoïde en ETRS89” “Unités administratives Convention Alpine, Frontière Nationale, Inventaire des logements” Adresses: répertoire officiel des localités avec le code postal et le périmètre associés Parcelles cadastrales||https://www.geo.admin.ch/fr/geoinformation-suisse/repertoire-inspire/donnees-de-base.html||SFSO|
|geocat.ch Catalogue suisse de géométadonnées||Catalogue des métadonnées des géodonnées Suisses||https://www.swisstopo.admin.ch/fr/cartes-donnees-en-ligne/catalogue-metadonnees-geocat.html||swisstopo|
|Répertoire [INSPIRE]||Toutes les géodonnées numériques disponibles de manière centralisée, subdivisé par thèmes.||https://www.geo.admin.ch/fr/geoinformation-suisse/repertoire-inspire.html||swisstopo|
- Socio-economic and geographical data at the level of the Cantons