Process and Analyze

@Ríma

Using Spending Monitor data as an indicator for households spending

Louise Julie Bille, Statistics Denmark, ljb@dst.dk
Jonas Dan Petersen, Statistics Denmark, jop@dst.dk
Households continually change spending patterns and times of crisis are not an exemption of this. It is important, as a producer of quarterly national accounts statistics, to adapt to these changes to continuously enhance quality of the estimates and statistics hereof. In Denmark, COVID-19 significantly affected one of the key indicators for compiling quarterly household expenditure. Therefore, an alternative indicator, assisting current indicators, was examined in compiling contribution of household spending to gross domestic product. Spending Monitor data provided by a bank (of significant size, measured by market share) is both more frequent and a good indicator of final consumption expenditure of the households. The bank produces aggregated data per day in pre-defined consumption groups and makes the data available on a weekly basis. However, the work with the alternative indicator is in an early stage and the pre-defined groups of final consumption need to be compliant with the Classification of Individual Consumption According to Purpose in terms of calculation in national accounts. Still, it has been found that Spending Monitor data is a good indicator for a range of aligned consumption groups in the national accounts department. This especially applies to consumption groups as clothing, footwear, restaurant and hotel spending, and other services paid by households. In addition, these consumption groups are often undertaking revisions from flash to revised estimations in quarterly national accounts and until annual estimates, in Denmark. However, it should be considered what bank card data is suitable in measuring as not all final consumption by households is measurable by transactions from bank cards. This holds for e.g. purchase of vehicles, electricity, water etc. Above all, elaborating work with implementing Spending Monitor data as an indicator is found to be an advantage to enhance quality of estimates in relation to household spending.
Keywords: Household expenditure, alternative indicator, bank card data
Session 2.1. Machine learning
When: Tuesday, August 23 at 13:00 - 13:55
Download presentation

Paper
Machine learning methods for estimating the Census population

Margherita Zuppardo, Statistics Iceland, margherita.zuppardo@hagstofa.is
Violeta Calian, Statistics Iceland, violeta.calian@hagstofa.is
Ómar Harðarson, Statistics Iceland, omar.hardarson@hagstofa.is
The dramatic development of machine learning and A.I. in the last decade opens up many new possibilities of improvement for the 2021 register-based Icelandic Census. We focus here on one of the main purposes of the census, which is to accurately describe the population of residents of Iceland. However, identifying the individuals belonging to this population is not trivial since many people do not notify national registers about their change of residence when moving abroad. Ignoring this phenomenon may create biases in statistical estimates of demographic or social characteristics e.g. age distributions, fertility or mortality rates, migration flows, employment and education profiles.
This is a binary classification problem where the status of any individual may be either in or out of the country. In this paper, we propose a systematic solution based on rigorous statistical methods, implemented as machine learning algorithms by using open source R-packages. To our knowledge, such techniques were not previously applied to demographic problems of this type. The data set used for training and testing the algorithms was built by using information regarding presence/absence of individuals from surveys combined with register data regarding for instance employment status, income and taxes, education level, changes in civil and residency status, family composition, previous migration events („signs of life“).
We trained several classification models such as random forests, classification trees, neural networks as well as their stacked versions. We assessed their performance according to measures which include sensitivity, specificity, confusion matrices, accuracy and information rates and their confidence intervals. We discuss the results obtained by applying these methods to the Census data.
Keywords: Census 2021, machine learning, signs of life, open source code
Session 2.1. Machine learning
When: Tuesday, August 23 at 13:00 - 13:55
Download presentation

Paper
Using non-survey big data to improve the quality of the household budget survey

Marius Runningen Larsson , Statistics Norway, riu@ssb.no
Li Chun Zhang, Statistics Norway
The household budget survey (HBS) is resource heavy. Both in terms of resources used by the national statistical offices (NSO) and due to the household's response burden. The long duration and diary component make it prone to non-response errors and incorrect records. To improve the quality of the HBS we present an alternative method for collecting and processing household grocery expenditure. The method takes advantage of non-survey big data consisting of electronic grocery receipts and debit card transactions. The receipts are combined with their corresponding transactions using multiple key variables. This makes it possible to allocate receipts to households via de-identified administrative records. Using data containing more than half a billion receipts we were able to allocate approximately 70 percent of the receipts to households. The data covers 96 percent of the Norwegian grocery market for 2018. The integrated data can be used to improve the quality of the HBS. Either by replacing the food and non-alcoholic beverages diary component to reduce the response burden or as auxiliary information to improve the survey-based expenditure estimates. The method is transferable to countries where grocery transactions are mainly carried out with payment cards. High market concentration in a country’s grocery market will increase the feasibility of the method.
Keywords: Non-survey big data, Household budget survey, transactional data, record linkage, model-based estimation
Session 2.1. Machine learning
When: Tuesday, August 23 at 13:00 - 13:55
Download presentation

Paper
Predicting a food product’s missing nutritional values using machine learning and matching algorithms from natural language processing

Annabelle Redelmeier, Norwegian Computing Center, anr@nr.no
Anders Løland, Norwegian Computing Center, anderslo@nr.no
Norway’s most sold food products can paint a picture on the nutrition trends of the society as a whole. However, analyzing nutrition trends is only possible if we have complete nutritional information on the food products purchased each month. Although manual data labelling based on the product names is possible to some extent, the data set in this study consists of 58 500 food products where 433 000 (67%) nutritional values are missing making this a very time consuming task.
One way to automatically label missing values is to use imputation methods. However, these tools can be over simplistic or assume that other features are known. In this study, we do not assume products have any known features and instead, rely on the product names. We make the assumption that products with similar product names will have similar nutritional values. For example, knowing the average amount of energy resulting from drinking `Tine whole milk 3,5% 1L’ can give a good energy estimate of a different whole milk product.
We train independent random forest models to predict each nutritional value, where the features are entirely based on the attributes of similar products. We find these similar products by deriving the Jaccard similarity index on pairs of product names. Then, for pairs with a Jaccard similarity over a threshold, we use K-nearest neighbours on the known nutritional values to find the best match. We train and validate each model on a subset of data where the given nutritional value is known. Finally, we compare the results of the machine learning model with results from a simple substitution imputation model.
Keywords: imputation, machine learning, Jaccard similarity, k-nearest-neighbours, data quality
Session 2.2. Refining databases
When:Tuesday, August 23 at 14:00 - 14:55
Download presentation

Paper
Using machine learning to classify theft offences to International Classification of Crimes

Kimmo Haapakangas, Statistics Finland, kimmo.haapakangas@stat.fi
The use of International Classification of Crime for Statistical Purposes (ICCS) is gradually increasing. However, mapping an existing national classification to correspond ICCS classes has its own challenges. National classifications are often based on national criminal law and its definitions, whereas ICCS is based on internationally agreed definitions. National laws might not always correspond to ICCS definitions.
Producing statistics according to ICCS would often require someone to read the definition of each reported offence or given sentence. Use of Natural Language Processing (NLP) and Machine Learning (ML) techniques might provide a solution to this problem.
Statistics Finland has gained access to prosecutor’s text descriptions of offence for the years 2019 and 2020. There are around 24.000 descriptions for theft, aggravated theft, and petty theft in the data. 2.000 of these are manually read and classified into ICCS 0501 and 0502 subcategories. Based on these observations, a Random Forrest model is trained to classify these texts. Overall accuracy of the model is 78.9 per cent. The model predicts theft from a shop (ICCS 050231) and theft of personal property from a person (ICCS 050221) very well with sensitivity and specificity of over 93 per cent.
Due to heavy class imbalance, some infrequent classes, like theft of public property, have only 15 per cent sensitivity. However, specificity is over 95 per cent.
Once trained, machine learning models are useful tools for mapping ICCS, and these models can be reused easily. For example, the distribution of theft offences in 2020 in ICCS is similar to the 2019 distribution.
As a drawback, training a NLP model can take some time and running it will require computational power, but in this case the investment is well worth it.
Keywords: International Classification of Crimes, machine learning, text classification
Session 2.2. Refining databases
When:Tuesday, August 23 at 14:00 - 14:55
Download presentation

Paper
From Manual to Machine: challenges in machine learning for COICOP coding

Susie Jentoft, Statistics Norway, susie.jentoft@ssb.no
Boriska Toth, Statistics Norway, boriska.toth@ssb.no
Daniel Muller, Norwegian University of Life Sciences, daniel.milliam.muller@nmbu.no
The classification of data is a time-consuming task performed by most statistical bureaus. Manually converting text to classifications can provide good quality data but requires both expert coders with good knowledge of the standards and many resource hours. These problems are amplified as larger data sources are incorporated into official statistics. Advancements in machine learning algorithms and increased accessibility to these are allowing new opportunities for classification workflows.
Here, we provide a case study for using machine learning algorithms to classify COICOP (classification of individual consumption according to purpose) in the Norwegian Household Budget Survey (HBS). The 2022 survey represents a new paradigm for the Norwegian HBS in combining a sample survey with novel big data sources and underscores the need to automate the classification process in modern surveys. Both survey and big data contain text fields with goods names that need to be coded to COICOP groups under the new UN COICOP classification.
A major hurdle within supervised machine learning is access to good quality training data. We devised several approaches to address this. Data from heterogenous auxiliary sources at Statistics Norway were used to generate a large dataset of item names with COICOP labels; however, methods had to be developed to match the same item written in different ways in different sources . These ranged from superficial rule-based methods to deeper NLP methods for semantic matching. Further, we devised a human-in-the-loop workflow that used the learning algorithms’ prediction probabilities to make several significant improvements.
Different algorithms were tested including random forest, logistic regression and support vector machines (SVM). Overall, random forests performed the best for predicting COICOP classification. Future work includes quality assurance and further balancing of the training data, implementing these algorithms in production and determining the training frequency and workflow.
Keywords: Supervised machine learning, classification, household budget survey, language processing, big data in official statistics
Session 2.2. Refining databases
When: Tuesday, August 23 at 14:00 - 14:55
Download presentation

Paper
A new framework for identifying the drivers of change in the labor market

Trond Christian Vigtel, Statistics Norway, tcv@ssb.no
Stine Bakke, Statistics Norway, eba@ssb.no
Øyvind Bruer-Skarsbø, Statistics Norway, obr@ssb.no
Magnus Berglund Johnsen, Statistics Norway, mjo@ssb.no
Knut Håkon Grini, Statistics Norway, gri@ssb.no
Thomas von Brasch, Statistics Norway, tly@ssb.no
As statisticians we regularly publish information that show how aggregates change over time, often broken down by different subgroups. Whilst this gives us information about the different subgroups, it is seldom clear to what extent factors like changing demographics or composition are driving these changes in the aggregate.
After the outbreak of the covid-19 virus in 2020, separating these drivers of change became even more important for understanding the large changes that took place in the labor market. For example, mean earnings change from one period to the next, and we can think of these changes as twofold: a change in composition of subgroups (demographics) impacts the aggregate change in earnings (a compositional effect), and a “pure” change not caused by changing group size that reflects the actual change in earnings (a price effect). This type of information can be of great value for end users, especially in the realm of the expected demographic changes in the coming years.
Statistics Norway has, in collaboration between our research department and labor force and earnings statistics, developed an exact additive method for calculating these two effects from a change in any weighted arithmetic mean. This calculation proved vital for understanding how changes in the composition of the labor market through the covid-19 pandemic impacted the measurement of earnings, because of restrictions imposed by the government. We argue that this method is easy-to-use and has a clear interpretation, which makes it very suitable for official statistics.
In this paper we present how this method works and how Statistics Norway has applied it in our publications on earnings and sickness absence. The method can be generalized to other areas and could prove useful for other countries as well. We will also present this method as an R-package, so that anyone can utilize this method either to examine their own data or by using published statistics.
Keywords: Labor Market Statistics, Earnings Statistics, Covid-19, Research
Session 2.3. Labour market issues: Gender pay gap, hours worked and different drivers
When: Tuesday, August 23 at 15:30 - 16:25
Download presentation

Paper
Multilevel modelling for gender wage gap analysis

Violeta Calian, Statistics Iceland, violeta.calian@hagstofa.is
Kristín Arnórsdóttir, Statistics Iceland, kristin.arnorsdottir@hagstofa.is
The purpose of this paper is to estimate the difference in hourly wages between female and male employees and its time evolution in Iceland, while accounting for a set of measured individual and employment characteristics. These include work experience in a given company, several demographic attributes of employees, education, occupation, economic activity of employer, female/male proportion of employees in the occupation category, economic sector and activity, size and location of company.
We find, by using multilevel models of wages (with interaction and both frequentist and Bayesian estimates), that the observed, total gap is explained by (i) the differences in average characteristics of men and women, and (ii) the differences in the effects of these characteristics on wages for the two genders. Certain covariates have a statistically significant advantageous effect on the wages of female/male employees and in addition this effect evolves with time.
For instance, women gain less than men do with increasing age and with increasing length of employment in same company, by being a supervisor or being married. On the other hand, women gain more by being in a labour union then men do, by being highly educated, working in the government sector or for municipalities. Both genders gain comparable wage advantages by working in occupations with a balanced mixture of men and women versus occupations dominated by women or by men, and/or working for a company with equal pay certification. Differences of occupational age composition effects are non-significant in Iceland, unlike differences observed in other countries where significant advantages are reported for men.
Keywords: multilevel models, gender pay gap, open R code
Session 2.3. Labour market issues: Gender pay gap, hours worked and different drivers
When:Tuesday, August 23 at 15:30 - 16:25
Download presentation

Paper
Model estimation of number of hours worked

Daniel Lennartsson, Statistics Sweden, daniel.lennartsson@scb.se
Susanne Gullberg Brännström, Statistics Sweden, susanne.gullbergbrannstrom@scb.se
There is a general interest in the number of hours worked in the Swedish economy. Hours worked are needed to be able to calculate work effort and productivity in the economy. But also, to analyze and evaluate the economy.
The purpose of the present work is to present two models that have been developed to estimate the number of hours worked in the Swedish economy; an estimate that is not based on the LFS to describe the hours worked in the economy using data that covers the whole working population. The work has been made possible by Statistics Sweden receiving monthly employer declarations at the individual level (PAYE) from the Swedish Tax Agency (SKV) since 2019. PAYE together with information from Statistics Sweden's business register (BR) and short-term statistics on wages and salaries (KL) have been used in the estimation of each model. An estimate like this with good quality are also planned to be presented in the new register-based labor market statistics (BAS).
To validate the models and their estimates the paper includes an analysis section in which a comparison of estimated hours worked is made based on the two models developed against hours worked according to the National Accounts (NA).
Keywords: model based, hours worked
Session 2.3. Labour market issues: Gender pay gap, hours worked and different drivers
When: Tuesday, August 23 at 15:30 - 16:25
Download presentation

Paper
Anonymization and anonymized text data in statistical production

Matti Kokkonen, Statistics Finland, matti.kokkonen@stat.fi
Katja Löytynoja, Statistics Finland, katja.loytynoja@stat.fi
Henna Ylimaa, Statistics Finland, henna.ylimaa@stat.fi
According to General Data Protection Regulation data minimization principle, personal data should be limited to what is necessary in relation to the processing purposes. Current process of statistics on road traffic accidents partly relies on humans reading text documents. Anonymizing these documents while keeping them logical is a challenge. In this paper we:
1. present briefly two tools for anonymization of free text fields
2. describe the results of testing of the anonymization tools
3. examine the effect of the anonymized data to statistical production
The tested tools were Anoppi, a Ministry of Justice tool using language technology -based artificial intelligence, and NameFinder, a tool created in Statistics Finland, that uses a combination of machine learning, morphological analyzer, and name lists.
NameFInder was tested with original statistical data. Anoppi was tested with simulated data based on investigation reports of Safety Investigation Authority of Finland (SIAF) and names from Population Information system. The SIAF data was selected for its similarity with the road traffic accident data. Tools produced confusion matrices.
NameFinder anonymized data was tested by simulating the steps of the statistical production process. Simulated steps include 1. checking the geopositions of the accidents, 2. completing the tabular data with the data from the free text, and 3. controlling the tabular information by making free text queries. Anoppi could not yet be tested in statistical production because of data protection rules as the tool is physically not in Statistics Finland premises.
Correct anonymization percentage was nearly same in the tools. However, Anoppi produced less false positives while keeping the documents more logical and human readable as opposed to NameFinder. Anonymization had a small effect on geopositioning since keywords were sometimes anonymized. However, the effect on the final statistics would be insignificant.
Keywords: Anonymization, free text, General Data Protection Regulation, Machine learning, Natural language processing,
Session 2.4. Creating analytical databases
When: Wednesday, August 24 at 10:30 - 11:25
Download presentation

Paper
Matched Educational Data: Methods for matching and analysis

Alex Skøtt Nielsen, Statistics Denmark, axn@dst.dk
Jens Bjerre, Statistics Denmark, jbe@dst.dk
Statistics Denmark is currently developing several new registries of ‘Matched Educational Data’ (MED), i.e., relational data that links students, teachers and activities together. These registries are of great interest to policy makers and researchers, since they allow for a more detailed understanding of teacher effects, peer effects and much more. This paper presents MED-products under development as well as two available products already in use by researchers.
A central purpose of the MED-registries is to provide data for the study of teacher effects. To this end, Statistics Denmark is currently using administrative school data to develop a registry of school activities covering all school subjects and non-formal activities (e.g. school trips) across the school year from 2020 and beyond. To capture teacher effects historically Statistics Denmark employ two already available data sources 1) Statistics Denmark’s registry of primary school students and 2) the Danish Ministry of Children and Education’s data of primary school teachers formal teaching competency and planned teaching hours.
Another application of the MED-registries is the study of peer effects at the individual and group level. With the before mentioned administrative school data it is possible to study peer effects at the level of groups, subjects and even at the level of singular activities. To cover peer effects historically, Statistics Denmark has produced a ‘Class-ID’ linking groups of primary school students together across time (school years) and space (schools). The Class-ID is a computed variable based on student composition in primary school classes.
This paper describes the methods for producing the MED-registries and how they can be used in analysis. We present methods for using administrative school data in official statistics and for assessing the quality of such data. Finally, we present results of two published articles based on these data.
Keywords: Education, Analysis, Policy-development, Matching, Peer effects, Primary school, Teacher effects
Session 2.4. Creating analytical databases
When: Wednesday, August 24 at 10:30 - 11:25
Download presentation

Paper
Measuring intangibles: Using register-based data as additional source to survey data for measuring R&D, ICT and organisational capital

Marina Rybalka, Statistics Norway, marina.rybalka@ssb.no
Intangible capital is becoming increasingly important in economic research, especially due to its contribution to productivity growth. While the core categories of the intangible capital are now widely accepted, their measurement still represents a significant challenge mainly due to the limited data availability. Recently, registry-based sources have been innovatively used within H2020-project GLOBALINTO to develop measures of intangible capital using occupation-based measurement approach based on linked employer-employee dataset and occupation classification ISCO08. To operationalize the concepts of R&D, ICT and organisational capital (OC), the GLOBALINTO-team has applied measures of investments in R&D, ICT and OC that are based on wage costs related to specific skills.
The following paper presents a methodological exercise, which explores whether the GLOBALINTO’s occupation-based measurement of R&D can be applied as additional measure to those based on survey data. Given that R&D survey does not cover the smallest firms (with 0-9 employees), this conceptualization might be very useful both for extension of the existing statistics with respect to small firms), as well as data sources for economic research.
The analysis finds that the occupational-based measure of R&D is more generous in the sense of identifying R&D active firms the larger firms are. For the small firms and firms in manufacturing and R&D services the aggregated measures of occupational-based and survey-based R&D are reasonably comparable, while the largest differences are observed for the large firms and firms in ICT services. Based on these findings the main recommendations are that the occupational-based measure of R&D can be used as a complementary data source to official R&D data to gain on information for the small firms but keeping in mind challenges with R&D definition for the ICT sector. Given that Norway is part of the European statistical system, the proposed methodology can be potentially used by other countries.
Keywords: measurement of intangibles, R&D, ICT, survey, register data
Session 2.4. Creating analytical databases
When: Wednesday, August 24 at 10:30 - 11:25
Download presentation

Paper

Using Spending Monitor data as an indicator for households spending

Machine learning methods for estimating the Census population

Using non-survey big data to improve the quality of the household budget survey

Predicting a food product’s missing nutritional values using machine learning and matching algorithms from natural language processing

Using machine learning to classify theft offences to International Classification of Crimes

From Manual to Machine: challenges in machine learning for COICOP coding

A new framework for identifying the drivers of change in the labor market

Multilevel modelling for gender wage gap analysis

Model estimation of number of hours worked

Anonymization and anonymized text data in statistical production

Matched Educational Data: Methods for matching and analysis

Measuring intangibles: Using register-based data as additional source to survey data for measuring R&D, ICT and organisational capital

Hagstofa Íslands

Carbon Neutral Conference in collaboration with