Exploring and Correcting the Bias in the Estimation of the Gini Measure of Inequality
Abstract
The Gini index is probably the most commonly used indicator to measure inequality. For continuous distributions, the Gini index can be computed using several equivalent formulations. However, this is not the case with discrete distributions, where controversy remains regarding the expression to be used to estimate the Gini index. We attempt to bring a better understanding of the underlying problem by regrouping and classifying the most common estimators of the Gini index proposed in both infinite and finite populations, and focusing on the biases. We use Monte Carlo simulation studies to analyse the bias of the various estimators under a wide range of scenarios. Extremely large biases are observed in heavy-tailed distributions with high Gini indices, and bias corrections are recommended in this situation. We propose the use of some (new and traditional) bootstrap-based and jackknife-based strategies to mitigate this bias problem. Results are based on continuous distributions often used in the modelling of income distributions. We describe a simulation-based criterion for deciding when to use bias corrections. Various real data sets are used to illustrate the practical application of the suggested bias corrected procedures.
GROBID Extracted text; discontinued.
This text is generated from TEI extraction for accessibility, search, and TTS. Formulas, tables, figures, page layout, and references may not perfectly match the original PDF.
Extracted abstract
The Gini index is probably the most commonly used indicator to measure inequality. For continuous distributions, the Gini index can be computed using several equivalent formulations. However, this is not the case with discrete distributions, where controversy remains regarding the expression to be used to estimate the Gini index. We attempt to bring a better understanding of the underlying problem by regrouping and classifying the most common estimators of the Gini index proposed in both infinite and finite populations, and focusing on the biases. We use Monte Carlo simulation studies to analyse the bias of the various estimators under a wide range of scenarios. Extremely large biases are observed in heavy-tailed distributions with high Gini indices, and bias corrections are recommended in this situation. We propose the use of some (new and traditional) bootstrap-and jackknife-based strategies to mitigate this bias problem. Results are based on continuous distributions often used in the modelling of income distributions. We describe a simulation-based criterion for deciding when to use bias corrections. Various real data sets are used to illustrate the practical application of the suggested bias corrected procedures.
1 Introduction Lorenz (1905) and Gini (1912) were the first to develop measures of inequality. More than a century after these contributions, inequality analysis remains an active and essential topic in numerous fields. The Gini index, also often referred to as the Gini coefficient, is probably the most commonly used to measure inequality. This indicator ranges between 0 and 1, where 0 indicates perfect equality, and 1 the opposite. Inequality is of special interest in economic studies (Piketty, 2015; Tridico, 2018) , and the Gini index is used especially to measure income inequality. Many studies indicate that income inequality has been increasing overall in recent years (Bonacini et al., 2021) , and marked differences can be observed across various countries. For instance, results from the πΈπ-ππΌπΏπΆ (European Union Statistics on Income and Living Conditions) survey show that Slovakia has the smallest Gini index estimate (0.21), while Turkey (0.43) has the highest of all the countries in the πΈπ-ππΌπΏπΆ. At a global level, the World Bank indicates that South Africa has the highest Gini index (0.63). These results are associated with countries, but more extreme values are expected at regional level, in subpopulations, small areas, etc.
The aforementioned differences across countries indicate that institutions and policies may have an important role to play in reducing inequality. In fact, reducing inequality is one of the 17 Sustainable Development Goals of the United Nations 2030 Agenda for Sustainable Development.
The Gini index is a common statistical tool employed by the 2030 Agenda for measuring inequality (see SzymaΕska, 2021) . However, it should be noted that in order to reduce inequality, it is crucial to be able to accurately measure this phenomenon without biases and/or errors. Indeed, Efron (1990) argued that a large bias is usually an undesirable aspect of an estimator's performance. The relevance of the Gini index has also been demonstrated by its use to describe inequality in many fields and among different socioeconomic groups, such as length of life or well-being (Wang et al., 2020) , educational opportunity (Bulle, 2016) , housing prices (Villar and Raya, 2015) , gender inequality (Larraz, 2015; Larraz et al., 2019) , input-and outcome inequality (Jasso, 2021) , and horizontal and vertical inequality (Canelas and Gisselquist, 2019) .
There is a formal theoretical definition of the Gini index for continuous distributions, and many equivalent formulations have been proposed in the literature. As discussed by Davidson (2009) , there is no disagreement about the definition of the Gini index for continuous distributions, since the various existing expressions provide the same outcome. However, for discrete distributions, many different formulations have been suggested in the extensive literature, and there has been notable controversy surrounding the appropriate version to use in this scenario (see also Langel and TillΓ©, 2013) . For discrete distributions, various expressions of the Gini index are plug-in formulations of theoretical definitions of the Gini index for continuous distributions. Throughout this article, we refer to this value derived from a continuous distribution as the true value of the Gini index, and formulations of the Gini index for discrete distributions are referred to as empirical versions or estimators of the true Gini index. A highly debated topic in the literature is whether or not to use a specific bias corrected estimator, which is denoted as πΊ in this paper. Jasso (1979) , Deltas (2003) and Davidson (2009) provide some arguments in favour of πΊ .
Statistical techniques can be based on infinite or finite populations. Classical statistical theory assumes that sampled units are independently selected from an infinite population, whereas survey sampling theory (see SΓ€rndal et al., 2003) considers that samples are selected from a finite population. Survey sampling has specific features, and this implies that statistical techniques designed for infinite populations must be modified so that they can be used for finite populations. For instance, the usual assumption of independence is not satisfied in finite populations when samples are selected without replacement. Note also that the use of continuous probabilistic distributions to model income and wealth distributions is common practice in many real-world applications. For instance, the Dagum, Pareto, Weibull and Gamma distributions are used, respectively, by PΓ©rez and Alaiz (2011) , Atkinson (2017) , Bakar and Pathmanathan (2020) and Salem and Mount (1974) . The Lognormal distribution is often used to model household income in many countries (see Clementi and Gallegati, 2005) .
This paper describes, in Section 2, the most common formulations for calculating the Gini index from both discrete and continuous distributions, and in scenarios of infinite and finite populations. The first aim of this paper is to regroup and classify existing empirical versions of the Gini index, and provide a better overview of the problem of estimating this parameter. The second aim is to analyse, in Section 3, the biases of different versions of the Gini index. For this purpose, we consider a variety of Gini indices and various probabilistic distributions commonly used to model income distributions. Our results reveal that extremely large biases may appear, especially for heavy-tailed distributions and large Gini indices, and bias correction procedures are recommended in this situation. As expected, the bias problem is more serious in small samples, as is the case of rural studies (Wan, 2001), small areas (Frabrizi and Trivisano, 2016) , subpopulations (SΓ€rndal et al., 2003, p. 386) , etc. The third contribution is to describe, in Section 4, bias correction procedures that may reduce the aforementioned large biases. Bootstrap and jackknife methods are considered, and a novel empirical bootstrap is also adapted to the problem of estimating the Gini index. In Section 5, the bias correction procedures are analysed using Monte Carlo simulation studies. Section 6 describes a simulation-based criterion for deciding when to use bias correction procedures, which are then illustrated, in Section 7, by application to various real data sets. Finally, a brief discussion is presented in Section 8. The supplementary material contains: (i) the selected parameters of the analysed probabilistic distributions; (ii) results from simulation studies based on large samples (π = 500); (iii) description of the bias functions suggested in Section 4, and information on their percentages of use; and (iv) efficiency and bias ratios of estimators of the Gini index, which are explored in Sections 5 and 6.
2
The Gini index
Definition
We assume that inequality is analysed using a variable of interest π, which is a nonnegative continuous random variable. A popular formulation of the Gini index is defined in terms of the average absolute difference between each possible pair of individuals (Qin et al., 2010) , i.e.,
πΊ = 1 2π |π₯ -π¦| ππΉ (π₯)ππΉ (π¦), (1)
where
π = πΈ[π] = π¦ π(π¦)ππ¦ = π¦ ππΉ (π¦),
is the mean of π, and πΉ (π¦) = π(π β€ π¦) and π(π¦) are, respectively, the distribution function and the probability density function of π. A formulation of πΊ based on the distribution function is (Qin et al., 2010; Berger and Gedik-Balay, 2020) :
πΊ = 1 π { 2πΉ (π¦) -1}π¦ππΉ (π¦). (2) Anand (1983) showed that the Gini index πΊ can be computed as 2/π times the covariance between π and the distribution function πΉ (π¦), i.e., πΊ = 2 π πππ£{π, πΉ (π¦)}.
(3)
Finally, Yitzhaki (1998) and Berger and Gedik-Balay (2020) consider the expression
πΊ = 1 - π π ,
where π = πΈ(π) = β« { 1 -πΉ (π§)}ππ§ is the expectation of the minimum π = min{π , π }, and π and π are two independent random variables with the same distribution as π. For continuous distributions, the Gini index can be defined in many other ways, as can be seen in Yitzhaki (1998), Giorgi and Gibliarano (2017) , etc. In practice, the value of πΊ is estimated by means of a sample π, with size π, and which can be selected from either infinite or finite populations (Langel and TillΓ©, 2013) . The estimation of πΊ under both scenarios is discussed in Section 2.2.
Estimation
For infinite populations, {π : π β π} are considered as a sequence, with size π, of nonnegative random variables with the same distribution as the variable of interest π. The Gini index is estimated using an estimator of πΊ based on the observations of individuals selected in the sample π, and which are denoted as {π¦ : π β π}. Such estimators are usually defined as plug-in formulations derived from a theoretical definition of πΊ. This methodology may introduce a bias in comparison to the true parameter πΊ, especially for extreme values of the Gini index. As can be seen in Section 3, a notable example is the plug-in expression of Equation ( 2 ), which is defined as (see Qin et al., 2010; Berger and Gedik-Balay, 2020) :
πΊ = 1 ππ¦ {2πΉ (π¦ ) -1}π¦ β = 2 ππ¦ π¦ β πΉ (π¦ ) -1, (4) where
π¦ = π β π¦ β is the sample mean, πΉ (π‘) = π β πΏ β (π¦ β€ π‘)
is the sample (empirical) distribution function, and πΏ(β ) is the indicator variable that takes the value 1 if its argument is true and 0 otherwise. The classical empirical version of πΊ (Giorgi and Gigliarano, 2017) is the plug-in expression of Equation (1), i.e.:
πΊ = 1 2π π¦ π¦ -π¦ β β .
(5)
Note that many equivalent versions of πΊ have been suggested in the extensive literature on the Gini index. For instance, Sen (1973) proposed the popular formulation
πΊ = 2 π π¦ ππ¦ ( ) β - π + 1 π = 2 π π¦ π β π¦ - π + 1 π ,
where π¦ ( ) are the values π¦ sorted in increasing order and π is the rank of unit π in the sample π.
Similarly, the Gini index can be defined using the regression coefficient of an ordinary least squares regression (see Ogwang, 2000) . This is the idea behind
πΊ = 2π½ π - π + 1 π ,
which assumes the regression model π = π½ + π’ , and where the heterocesdatic error π’ has variance π /π¦ ( ) . The least squares estimator of π½ is given by
π½ = β π β π¦ ( ) β π¦ ( ) β . (6)
Finally, an equivalent version of πΊ is the empirical version of Equation (3), i.e.,
πΊ = 2 ππ¦ πππ£ π, π¦ ( ) ,
where
πππ£ π, π¦ ( ) = 1 π π β π¦ ( ) - π + 1 2 π¦.
The estimator πΊ and its equivalent versions satisfy the symmetry axiom of Sen (1973) , which establishes that an estimator of πΊ based on a set of observations, say {π¦ : π β π}, must coincide with the Gini index estimated by means of the same approach but using the sample π exactly replicated, i.e., doubled in size (see Davidson, 2009) . Alternatively, the bias corrected estimator
πΊ = π π -1 πΊ , (7)
is often used instead of πΊ . Some equivalent expressions of πΊ are: Jasso (1979) suggested the use of πΊ , Wang et al. (2016) consider πΊ , and πΊ is used by Berger and Gedik-Balay (2020) , where π§ = π β π§ .
πΊ = 2 π(π -1)π¦ π β π¦ ( ) - π + 1 π -1 ; πΊ = 1 2π¦ π 2 π¦ -π¦ ; πΊ = 1 - π§ π¦ .
β
and
π§ . = 1 π -1 min β ,
π¦ , π¦ . (2004) and Davidson (2009) provide theoretical justifications for the use of πΊ to reduce the bias of πΊ . As can be seen in Section 3, πΊ may result in serious biases for small Gini indices, but this problem can be easily solved by replacing πΉ (π‘) in Equation (4) with the smooth (or midpoint) distribution function πΉ * (π‘) = π β [πΏ(π¦ < π‘) + 0.5πΏ(π¦ = π‘)] β , and the resulting estimator coincides with πΊ (see Berger, 2008) . In addition, πΊ , πΊ and πΊ are related when π¦ β π¦ for all π β π, since
Giles
πΊ = πΊ - 1 π , (8)
if this condition is satisfied, and
πΊ = π π -1 πΊ - 1 π (9)
according to Equations ( 7 ) and ( 8 ). Note that expressions (5) and ( 7 ), or their equivalent formulations, are more frequently used in practice, and practitioners must be aware of the bias of πΊ when πΊ is small. The use of the smooth distribution function in Equation (4) will prevent this bias problem. For empirical distributions, additional formulations of the Gini index can be seen in Giorgi and Gigliarano (2017) .
For a finite population π with π individuals, {π : π β π} denotes a sequence of nonnegative random variables with the same distribution function πΉ (π¦), and {π¦ : π β π} are the population values of the variable of interest. In practice, social surveys are used to estimate the Gini index, and they are generally based on complex sampling designs with unequal probabilities. Therefore, the sample π is now selected from π by using a sampling design with survey weights π€ = π , where π = π(π β π) are the inclusion probabilities, with π β π. The problem of estimating πΊ from finite populations thus entails two steps. First, an empirical version of πΊ based on the population values {π¦ : π β π} is required. We denote the population empirical versions of πΊ in finite populations as πΊ , πΊ and πΊ , and they are defined as πΊ , πΊ and πΊ , respectively, after substituting the sample values with the population values in Equations ( 4 ), ( 5 ) and ( 7 ). The second step is to estimate the selected population empirical version (πΊ , πΊ or πΊ ) using weighted estimators. Some that can be found in the literature are:
πΊ = 2 π π¦ π€ β π¦ πΉ (π¦ ) -1; (10) πΊ = 1 2π π¦ π€ β β π€ π¦ -π¦ ; (11)
and
πΊ = 1 - π§ π¦ , (12)
where
π = β π€ β , π¦ = π β π€ β π¦ , π§ = π β π€ β π§ . , π§ . = 1 π -π€ π€ β ,
min π¦ , π¦ , and πΉ (π‘) = π β π€ β πΏ(π¦ β€ π‘). Note that Equations ( 10 ), ( 11 ) and ( 12 ) reduce, respectively, to Equations (4), ( 5 ) and ( 7 ) under simple random sampling without replacement (SRSWOR).
3
Simulation studies to analyse the bias
In this section, we analyse the bias of πΊ , πΊ and πΊ in comparison to the true (asymptotic) value πΊ, and using samples selected from infinite populations. This analysis is equivalent to the problem of analysing the bias of πΊ , πΊ and πΊ in comparison to the true (asymptotic) value πΊ.
Description
We consider various continuous probabilistic distributions (Pareto, Dagum, Lognormal, Weibull and Gamma) often used in the modelling of income distributions. For each probabilistic distribution, parameters involved in the theoretical formulation of πΊ are selected such that πΊ takes the values {0.1,0.2, β¦ ,0.8}, thus allowing us to examine different levels of inequality. Additional parameters required in distributions are also fixed, and all of them can be seen in the supplementary material (Table A1 ). For the Dagum distribution, the theoretical value of πΊ depends on both shape parameters π and π, and for this reason the values π = {0.5,20} are also fixed, and such distributions are denoted, respectively, as Dagum-p0.5 and Dagum-p20. The aim is to analyse the biases of the various estimators of the Gini index under the described scenarios, with these estimators being calculated using samples randomly drawn from an underlying continuous distribution with a true value πΊ for the Gini index. This framework is also adopted by Deltas (2003) , Davidson (2009) , Berger and Gedik-Balay (2020) , etc. We analyse both small and large sample sizes, specifically, π = {50,500}. This study is equivalent to analysing the biases for samples, with size π, selected under SRSWOR from a large finite population (π β β), with population values drawn from the analysed probabilistic distributions.
Let π be a given statistic for the unknown parameter π, based on the observations {π¦ : π β π}. Throughout this article, the expected value based on π replications of π is defined as
πΈ π = π βΎ = 1 π π ( ) , (13)
where π ( ) is the statistic π evaluated at the π-th pseudo original sample π ( ) , which is also selected, with size π, from the distribution function πΉ (π¦). π = 1000 replications are considered in simulation studies. The empirical measures can be expressed in terms of either the true Gini index πΊ or the expected values of estimators. We use the expected values because large biases can be obtained and πΊ is unknown in practice. Reporting the results in terms of πΊ makes it more difficult for empirical researchers to assess the performance of estimators for the specific data that they are analysing. Finally, note that various figures in this paper require only a customary estimator of πΊ, and we use πΊ because it is less biased than its competitors (πΊ and πΊ ).
In this section, we first use Monte Carlo simulations to investigate the relative bias (π π΅) of the various empirical versions (πΊ , πΊ and πΊ ) in comparison to the true (asymptotic) value πΊ. For a given statistic π , this measure is defined as
π π΅ = 100 Γ π΅ π π ,
where the empirical bias is given by π΅ π = πΈ π -π = π βΎ -π. Comparisons are based on distributions with different levels of skewness because the value of the coefficient of skewness may have an impact on the bias of estimators of the Gini index. For discrete distributions, the coefficient of skewness is defined as:
πΎ = π . π , (14)
where π = (π . ) / is the sample standard deviation, and π . = π β (π¦ -π¦) β is the πΌ-th central moment based on π. The aim of Figure 1 is to investigate the skewness for the probabilistic distributions considered in this paper, so this figure displays the expected values πΎ βΎ versus the expected values πΊ βΎ . , where πΎ βΎ and πΊ βΎ . are calculated using Equation ( 13 ) after substituting π ( ) with πΎ ( ) and πΊ ( ) , respectively, and which are computed using Equations ( 14 ) and ( 7 ) at the π-th pseudo original sample π ( ) .
Figure 1 : Expected values of the coefficient of skewness (πΎ βΎ ) based on samples with sizes π = {50,500}, and randomly selected from various continuous probabilistic distributions (infinite populations). The x-axes show the expected values of the estimator πΊ (πΊ βΎ . ).
From Figure 1 we observe that the Pareto distribution is the most highly skewed distribution, followed by the Dagum-p20, Dagum-p0.5 and Lognormal distributions, in that order. The Weibull and Gamma distributions have similar values of πΎ βΎ , and they are the least skewed distributions in
n = 50 Expected values of G n c Expected skewness 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 1 2 3 4 n = 500 Expected values of G n c 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 2 4 6 8 10 12 14 Pareto Dagum-p20 Dagum-p.05 Lognormal Weibull Gamma
this study. For highly skewed distributions, serious biases can be observed when πΊ is large, as a result of which the maximum value of πΊ βΎ . is far from πΊ = 0.8, the maximum Gini index used in this study. This is not the case with less skewed distributions, since accurate estimates are also obtained when πΊ is large, and the expected values πΊ βΎ . are close to the required Gini index. For the various probabilistic distributions, the expected skewness increases as the sample size rises. This can be explained by the upper bound for πΎ based on the sample size and suggested by Cramer (1957) . This bound indicates that estimates for the coefficient of skewness may underestimate the true value when the sample size is small (DoriΔ et al., 2009) . For the Dagum distribution, the expected skewness increases as its shape parameter π increases.
In this section we also analyse the impact of the skewness on the bias using box plots for estimates of πΊ . Thus, we illustrate this relationship between skewness and bias by comparing two different distributions in terms of skewness (Pareto and Gamma, as can be seen in Figure 1 ).
Results and conclusions
Figure 2 displays the π π΅π of πΊ , πΊ and πΊ when π = 50. First, we analyse the results from the less skewed distributions (Weibull and Gamma). The bias of πΊ is negligible for the various expected values of estimators. The bias of πΊ is slightly larger, in absolute terms, than that of πΊ , but lies within a reasonable range. Biases of both πΊ and πΊ do not seem to be affected by the value of the Gini index. πΊ is severely biased when the expected values of estimators are small, with values of π π΅ that can be close to 20%. This empirical version must be modified to correct this bias, and two simple solutions are discussed in Section 2.2. First, the distribution function πΉ (π‘) can be replaced, in Equation ( 4 ), by the smooth distribution function πΉ * (π‘). This adjustment allows empirical versions πΊ and πΊ to be equivalent. Second, we can use one of the transformations described in Equations ( 8 ) and ( 9 ) when all observations are different. As the Gini index increases, the π π΅ of πΊ decreases and πΊ and πΊ have similar π π΅π .
For heavy-tailed distributions (Pareto, Dagum-p20, Dagum-p0.5 and Lognormal), biases of πΊ , πΊ and πΊ seem to be affected by the value of the Gini index, and serious negative π π΅π are obtained as the expected values of estimators increase (as much as -25%). We also observe a strong relationship between the π π΅ and the coefficient of skewness. The largest π π΅π , in absolute terms, are produced by the Pareto distribution, which is the most skewed (see Figure 1 ), and biases, in absolute terms, decrease as the values of πΎ βΎ decrease.
For larger sample sizes, readers are referred to the supplementary material, where Figure A1 replicates Figure 2 for samples with size n=500. We point out that the π π΅, in absolute terms, decreases as the sample size increases. For less skewed distributions, the bias of πΊ is negligible, and the π π΅ of πΊ can be close to 2% for the various distributions. Non-negligible biases are also observed for heavy-tailed distributions, with values of RB close to -15% when π = 500. In Figure 3 we investigate the effect of the skewness on the bias of πΊ using box plots and various Gini indices, with the most (Pareto) and the least (Gamma) skewed distributions from this study.
From Figure 2 we observe that the bias of πΊ is negligible for the various Gini indices when samples are selected from the Gamma distribution, while Figure 3 confirms that the estimates are concentrated, with a low variability, around the target value πΊ. This is not the case with the Pareto distribution, which shows highly biased estimates and marked variability. From Figures 2 and 3 we observe that the bias of πΊ , in absolute terms, increases as the Gini index rises, while from Figure 3 we see that the variability of estimates also becomes higher as πΊ increases, with values of πΊ that can be larger than 0.9 when πΊ = 0.4, or smaller than 0.3 when πΊ = 0.8.
Pareto RB 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 -20 -10 0 10 Dagum-p20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 -20 -10 0 10 Dagum-p0.5 RB 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 -10 -5 0 5 10 15 Lognormal 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 -5 0 5 10 15 Weibull Expected values of estimators RB 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 5 10 15 Gamma Expected values of estimators 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 5 10 15 G n a G n b G n c
Bias correction procedures
Results from Section 3 indicate that the customary estimators of the Gini index can be severely biased, especially for heavy-tailed distributions and large Gini indices, and that the use of a bias correction procedure may alleviate this bias problem. Various bias correction procedures are presented in this section, before being analysed in Section 5, and then applied to various real data sets in Section 7. Section 6 describes a criterion for deciding when to use bias corrections.
Bootstrap and jackknife techniques (see Efron and Tibshirani, 1993, and Wolter, 2007) can be used as bias correction procedures. Some authors who have demonstrated the capacity of these methods to correct biases are Pfeffermann and Correa (2012) and Jiao and Han (2020) . When it comes to the problem of estimating the Gini index, such statistical techniques have been used mainly for the construction of confidence intervals and variance estimation (see Moran, 2006 , and Larraz et al., 2020 , for bootstrap techniques, and Berger, 2008 , and Davidson, 2009 , for jackknife techniques). The estimator πΊ emerges as the bias corrected version of πΊ (Deltas, 0.2 0.4 0.6 0.8 1.0 G Estimation of the Gini index 0.2 0.4 0.6 0.8 Pareto Gamma
2003; Davidson, 2009) , but Section 3 shows that πΊ can also be severely biased. Van Ourti and Clarke (2011) investigated a bias correction method for the Gini index, but focussing on the bias due to grouped data. We now explore correction procedures for the biases discussed in Section 3. Bias corrections are applied to πΊ and πΊ (defined, respectively, for infinite and finite populations) because they are less biased than the alternative empirical versions described in Section 2. However, bias correction procedures can also be applied to any other empirical version.
For an infinite population, we first suggest the jackknife technique proposed by Ogwang (2000) , which can be easily implemented by a fast algorithm. Langel and TillΓ© (2013) showed that this method has desirable properties for the variance estimation of the Gini index. Ogwang (2000) proposed the application of the jackknife technique on πΊ , with jackknife estimates defined as
πΊ (π) = πΊ + 2 ππ¦ -π¦ ( ) π¦ ( ) π½ π + β π π¦ ( ) π(π -1) - ππ¦ -β π¦ ( ) + ππ¦ ( ) π -1 - 1 π(π -1)
,
where π½ is the regression coefficient defined by Equation ( 6 ). Note that πΊ (π) is equivalent to applying πΊ successively to the observations {π¦ ( ) : π β π} and after removing the π -th unit. The bias corrected estimator applied to πΊ and based on Ogwang's jackknife is defined as
πΊ . = ππΊ -(π -1)πΊ βΎ . ,
where πΊ βΎ . = π β πΊ (π). The bias corrected estimator applied to πΊ and based on jackknife is given by
πΊ . = ππΊ -(π -1)πΊ βΎ . , (15)
where πΊ βΎ . = π β πΊ (π), and πΊ (π) is the estimator πΊ computed from the observations {π¦ ( ) : π β π} after removing the π -th unit. Note that πΊ . is one of the two bias corrected estimators that we report in the results from infinite populations. Pfeffermann and Correa (2012) proposed an empirical bootstrap bias correction procedure based on pseudo original and bootstrap samples selected from plausible parameters. This method was used to estimate the prediction mean square error in small area estimation of proportions. We also propose the adaption of this empirical bootstrap method to the problem of estimating the bias of πΊ , thus giving rise to a novel bias corrected estimator of πΊ.
The empirical bootstrap procedure considers a set of plausible parameters, which are randomly generated from a confidence interval for πΊ. For each plausible parameter, a pseudo original sample is generated from the underlying distribution of the original sample data. This method uses a cross-validation procedure that splits the various pseudo original samples into two groups: training and validation. In Section 3 we observed that both the Gini index and the coefficient of skewness have an impact on the bias of πΊ for heavy-tailed distributions. For the training group, we suggest various functions underlying the bias correction, which depend on the estimates of πΊ and πΎ computed from each pseudo original sample and on the expected values of πΊ and πΎ based on bootstrap samples. Efron and Tibshirani (1993) and Hall and Maiti (2006) indicate that bias corrections may increase the variance, so the validation group is used to choose the optimum bias function that minimizes the mean square error (MSE) of the suggested bias corrected estimator. In Section 5, we also investigate the impact of using bias corrections on the MSE. For an infinite population, the algorithm for estimating the bias of πΊ and for computing the suggested bias corrected estimator is described in detail as follows:
Step 1 (Plausible parameters). Select at random π» plausible values for the target parameter πΊ from a Uniform distribution, i.e., πΊ βΌ ππ(πΊ , πΊ ), with β = 1, β¦ , π», and where πΊ and πΊ are, respectively, the lower and upper limits of a confidence interval for the true Gini index πΊ.
Step 2 (Pseudo original samples for training and validation groups). Generate a pseudo original sample π , with size π, from π(π¦; πΊ ) and for each β = 1, β¦ , π», where π(π¦; πΊ ) is the probability density function π(π¦) with a Gini index equal to πΊ . Then, split the π» samples at random into two groups, the training group πΊ and the validation group πΊ , such that πΊ contains a set of π samples (π , with π‘ = 1, β¦ , π), πΊ contains π samples (π , with π£ = 1, β¦ , π) and π» = π + π.
Step 3 (Estimates from the pseudo original samples). For the training group πΊ , compute πΊ and πΎ for each sample π , and using, respectively, Equations ( 7 ) and ( 14 ). Similarly, for the validation group πΊ , compute πΊ and πΎ using the samples π .
Step 4 (Training phase: expected values based on bootstrap samples). For the training group πΊ , generate π΅ bootstrap samples π ( ) , with size π, from each sample π , with π = 1, β¦ , π΅. Compute πΊ ( ) and πΎ ( ) for each bootstrap sample π ( ) , using Equations ( 7 ) and ( 14 ), respectively. The expected values of πΊ and πΎ based on bootstrap samples are denoted, respectively, as πΊ βΎ . and πΎ βΎ , and are computed using πΊ ( ) and πΎ ( ) in Equation ( 13 ), after substituting π and π with π and π΅, respectively.
Step 5 (Training phase: expected values based on pseudo original samples). For the training group πΊ , generate π pseudo original samples π ( ) , with size π, from π(π¦; πΊ ) and for each π‘ = 1, β¦ , π, with π = 1, β¦ , R. Compute πΊ ( ) for each sample π ( ) , and using Equation ( 7 ). The expected value based on pseudo original samples is denoted as πΊ βΎ . , and is computed using Equation (13).
Step 6 (Training phase: coefficient estimates). For the training group πΊ , estimate the unknown coefficients of a set of eligible bias functions π πΊ ; πΊ βΎ . ; πΎ ; πΎ βΎ , with π = 1, β¦ , πΏ, that predict the variable π· = πΊ βΎ . -πΊ . An example of bias function is the linear expression
πΊ βΎ . -πΊ = π + π (πΎ βΎ -πΎ ). ( 16
)
Step 7 (Validation phase: bias corrected estimators). For the validation group πΊ and for each function π , compute the suggested bias corrected estimator of πΊ , defined by
πΊ . (π) = πΊ -π πΊ ; πΊ βΎ . ; πΎ ; πΎ βΎ ,
where π is the bias function π after substituting its coefficients with the estimates computed in Step 6.
Step 8 (Validation phase: optimum function). For the validation group πΊ , identify the optimum function π that minimizes the MSE of the estimators πΊ . (π), and which is defined as
πππΈ = 1 π πΊ . (π) -πΊ .
Step 9 (Bias corrected estimator). Compute the suggested bias corrected estimator πΊ .
= πΊ -π πΊ ; πΊ βΎ . ; πΎ ; πΎ βΎ ,
where πΊ βΎ . and πΎ βΎ are, respectively, the expected values of πΊ and πΎ based on bootstrap samples and derived from the original sample π. β‘ πΊ .
is the second bias corrected estimator computed for infinite populations. Any method can be used to construct the confidence interval required in Step 1. Some existing confidence intervals for the Gini index are based on bootstrap (Qin et al., 2010) , jackknife (Berger, 2008) , linearization (Deville, 1999) or empirical likelihood (Berger and Gedik-Balay, 2020) . Variance estimators for the Gini index (Langel and TillΓ©, 2013) can also be used to construct confidence intervals based on the normality assumption. For the sake of simplicity, we use the traditional bootstrap method with confidence interval limits given by πΊ = πΊ ( . ) . and πΊ = πΊ ( . ) .
, where πΊ ( ) .
is the πΌ-th quantile of the bootstrap estimates πΊ ( ) . The latter are computed using the estimator πΊ on the bootstrap sample π ( ) , which is taken, with size π, from the original sample π. As discussed by Pfeffermann and Correa (2012) , the selected interval must be broad enough to contain the target parameter πΊ. We use a confidence level of 99% because of the serious biases detected in Section 3. Pfeffermann and Correa (2012) also argue that the size of this confidence interval has no direct effect on the bound of the bias, and give a discussion on the number of parameters that should be included in the training and validation groups.
For the training group, Step 6 requires bias functions π that predict π· = πΊ βΎ . -πΊ with the aim of estimating the bias of πΊ , i.e., π΅ πΊ = πΈ πΊ -πΊ. In Section 3, we concluded that both the Gini index and the coefficient of skewness may have an impact on the π π΅ of πΊ , so we suggest bias functions that depend on πΊ and πΎ . The expected values πΊ βΎ . and πΎ βΎ based on bootstrap samples are also considered. Table A2 from the supplementary material describes the πΏ = 7 candidate bias functions considered in Step 6 of the suggested algorithm. For the sake of simplicity, we only consider multiple linear regression functions, but more complex functions and/or additional statistics can also be used, and are expected to yield more accurate results.
For an infinite population, additional bias correction procedures can be computed (See Wolter, 2007) . For instance, we also calculated, in Section 5, bootstrap methods based on additive and multiplicative corrections (see Hall and Maiti, 2006, and Pfeffermann and Correa, 2012 , for detailed definitions), but we omitted them because the bias correction estimators ( 15 ) and ( 17 ) are less biased. Pfeffermann and Correa (2012) also argue that the aforementioned additive and multiplicative corrections may yield non-negligible biases with small samples, meaning alternative bias correction procedures may be preferable.
Bootstrap and jackknife techniques were originally designed for infinite populations, and do not have a direct application to finite populations due to the inherent features of survey sampling. Adjustments are thus required to apply these methods to finite populations (Quatember, 2015) . For finite populations, the rescaled bootstrap technique (Rao et al., 1992 ) can be used for bias correction of a given empirical version of πΊ. This method has been used in many research studies (Berger and MuΓ±oz, 2015; Moya et al., 2020; etc.) in many areas (see Yang et al., 2010; MuΓ±oz et al., 2018; etc.) . Simplicity is the main advantage of the rescaled bootstrap over alternative bootstrap methods, which can be more computationally intensive. The rescaled bootstrap consists in computing a new set of weights (named bootstrap weights) for each bootstrap sample, which are obtained by applying a scale adjustment to the original survey weights π€ . Specifically, the bootstrap weights are given by
π€ ( ) = π€ π π π -1 ,
with π β π and π = 1, β¦ , π΅, where π denotes the number of times that π -th unit is selected in the bootstrap sample π ( ) . For a finite population, we first consider the additive bias corrected estimator of πΊ based on the rescaled bootstrap, which is defined as
πΊ . = πΊ -πΊ βΎ . -πΊ = 2πΊ -πΊ βΎ . , (18)
where πΊ βΎ . is the expected value of πΊ based on the bootstrap estimates πΊ ( ) , and which are defined as πΊ after substituting the original survey weights π€ with the bootstrap weights π€ ( ) .
Second, we also consider the aforementioned empirical bootstrap bias correction. We now describe an extension of this method to finite populations. It requires the rescaled bootstrap technique along with confidence intervals and estimators based on survey weights. A confidence interval that can be used in Step 1 is given by the limits πΊ = πΊ ( . )
.
and πΊ = πΊ ( . )
.
, where
πΊ ( )
.
is the πΌ-th quantile of the weighted estimates πΊ ( ) derived from the rescaled bootstrap and based on πΊ . The following Step 1-b must be included between Steps 1 and 2:
Step 1-b (Pseudo original finite population). Generate a pseudo original population π * with observations given by {π¦ * : π β π * } and selected from π(π¦; πΊ ), with π = 1, β¦ , πΎ.
The pseudo original samples of Steps 2 and 5 are selected from π * instead of π(π¦; πΊ ), and using the same sampling design as for the original sample π. The weighted coefficient of skewness is defined as
πΎ = π . π , (19)
where π = (π . ) / and π . = π β π€ β π¦ -π¦ . In Step 3, πΊ and πΎ are replaced by πΊ and πΎ , which are calculated using the sample π and Equations ( 12 ) and ( 19 ), respectively. In Step 4, bootstrap estimates are substituted with πΊ ( ) and πΎ ( ) , which are obtained using the rescaled bootstrap method. The expected values of πΊ and πΎ based on the rescaled bootstrap are denoted, respectively, as πΊ βΎ . and πΎ βΎ , and they are computed using πΊ ( ) and πΎ ( ) in Equation (13), after substituting π and π with π and π΅, respectively. The same set of eligible functions are used in Step 6, but they depend on weighted quantities, i.e., π πΊ ; πΊ βΎ . ; πΎ ; πΎ βΎ . In Step 7, the suggested bias corrected estimators of πΊ are defined by πΊ . (π) = πΊ -π πΊ ; πΊ βΎ . ; πΎ ; πΎ βΎ , and they are used to identify, in Step 8, the optimum function π . Finally, the suggested bias corrected estimator that we compute in finite populations is given by πΊ . = πΊ -π πΊ ; πΊ βΎ . ; πΎ ; πΎ βΎ . (
) 20
For finite populations, additional bias correction procedures can also be computed. For instance, we also calculated, in Section 5, Campbell's (1980) jackknife (see Berger and Skinner, 2005, and Berger, 2008) and the multiplicative bootstrap method (see Hall and Maiti, 2006 ), but we omitted them because Campbell's jackknife is more biased than ( 18 ) and ( 20 ), and additive and multiplicative methods give similar results.
5
Simulation studies to analyse the bias correction procedures
Description
We now evaluate various bias correction procedures by means of the RB measure defined in Section 3, and using the probabilistic distributions and sample sizes also described in that section. As discussed in Section 4, bias correction procedures substantially mitigate the detected biases, but the price to pay is a possible increase in the MSE. For this reason, we use the relative root mean square error (π π πππΈ) to investigate the effect of bias corrections on efficiency. For a given statistic π , the corresponding π π πππΈ based on π replications is defined as
π π πππΈ = 100 Γ πππΈ π / π ,
where the empirical mean square error is given by πππΈ π = π β π ( ) -π . Sections 6 and 8 give more detailed discussions on the importance of both bias and MSE measures. In Section 3, we observed that πΊ yields extremely large values of π π΅ when πΊ is small, and πΊ is slightly more biased than πΊ . For the sake of clarity, πΊ and πΊ and the corresponding bias corrected estimators are omitted from the figures in this section, but they can also be computed, as discussed in Section 4. Similarly, for finite populations, weighted estimators of πΊ and πΊ are omitted. For infinite populations, the percentage of the number of times that each eligible bias function is selected as the optimum function of the suggested algorithm described in Section 4 can be seen in the supplementary material (Table A3 ). For the various probabilistic distributions, the bias function defined in Equation ( 16 ) is the most often selected as the optimum function.
π΅ = 1000 bootstrap samples are used in bootstrap methods. Following Pfeffermann and Correa (2012) , we consider π» = 200 plausible parameters, of which π = 60 and π = 140 are used for the training and validation groups, respectively. Samples are selected from finite populations with size π = 10000, which in turn are drawn from the investigated continuous distributions. We consider unequal inclusion probabilities by using the randomized systematic sampling design (Wu and Thompson, 2020) . The effect of the design is increased by generating inclusion probabilities π with a correlation of 0.7 between π and π¦ (Berger and Gedik-Balay, 2020) . Pareto RB 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 -24 -18 -12 -6 0 Dagum-p20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 -15 -10 -5 0 Dagum-p0.5 Expected values of estimators RB 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 -10 -7 -4 -1 2 Lognormal Expected values of estimators 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 -4 -2 0 2 G n c G n c.Jo G n based on samples with size π = 50, and randomly selected from various continuous probabilistic distributions (infinite populations).
A simulation-based criterion for deciding when to use bias correction
Results from Section 3 indicate that the three common empirical versions of πΊ can be biased for heavy-tailed distributions, which may be a serious issue when the Gini index is large. An important problem that arises in practice is determining when to use bias correction procedures. Note that the bias and the MSE (or equivalently the RB and the are two relevant measures to evaluate the quality of estimators. However, as noted by SΓ€rndal et al. (2003, p. 164) , the bias must also be small relative to the standard error, since failure to meet this requirement
Pareto RRMSE 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 20 22 24 26 28 30 32 Dagum-p20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 16 18 20 22 24 26 Dagum-p0.5 Expected values of estimators RRMSE 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 12 13 14 15 16 Lognormal Expected values of estimators 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 9.5 10.0 11.0 12.0
G n c G n c.Jo G n c.Bp
may result in invalid confidence intervals and/or undesirable coverage probabilities. This ratio between the bias and the standard error is popularly referred to as the bias ratio. For a given statistic π and π replications, the empirical bias ratio (BR) is defined as
π΅π = π΅ π π π / ,
where the empirical variance is given by π π = π β π ( ) -π Μ . Like the MSE, the BR also involves both bias and variance of the estimator. SΓ€rndal et al. (2003, p. 41) advise empirical researchers to avoid estimators that are considerably biased, and instead seek out estimators with small biases, and then choose one with a small variance. Following this idea, the suggested criterion consists of analysing both RB and BR measures for the estimator πΊ , and bias corrections are suggested when non-negligible biases are observed. RRMSE values can be used to choose the most efficient bias correction estimator. SΓ€rndal et al. (2003, p. 165) indicate that the effect of the bias ratio on the coverage probability can be ignored when |π΅π | < 0.1, and the use of bias corrected estimators is not justified here. The effect on the coverage probability is not extremely pronounced when |π΅π | β€ 0.5, but it can be a serious problem otherwise. On the other hand, absolute values of RB lower than 2% can be considered negligible, and bias correction procedures are not recommended if this is the case. In summary, bias corrections are suggested when the estimator πΊ satisfies |π΅π | β₯ 0.1 and |π π΅| β₯ 2%.
The aim of Figure 7 is to show that the customary estimator πΊ can yield poor bias ratios, with bias corrections justified because they substantially minimize this problem. For infinite populations and samples with size π = 50, πΊ yields absolute values of π΅π close to 1.4, and poor coverage probabilities are expected. Furthermore, the vertical lines in Figure 7 indicate the first expected value (πΊ Μ . ) with non-negligible biases, i.e., with absolute values of π π΅ larger than 2%. We see that the condition imposed by the bias ratio (|π΅π | < 0.1) is more demanding than the condition based on the relative bias (|π π΅| < 2%), i.e., the first value of πΊ Μ . with a |π΅π | β₯ 0.1 is smaller than the first value of πΊ Μ . with a |π π΅| β₯ 2%. For example, non-negligible biases are observed for the Pareto distribution when πΊ Μ . β 0.2, and the absolute value of π΅π is larger than 0.1 in this situation. These results reveal the presence of a mild bias problem, which can be solved using bias correction procedures, as can be seen in Figures 4 and 7 . From Figure 7 we also observe that the BR values of the corrected estimators, in absolute terms, are generally smaller than 0.5, and are substantially smaller than those of πΊ . The desirable properties in terms of both π΅π and π π΅ measures and the negligible impact on the efficiency (see the Pareto distribution in Figure 6 when πΊ Μ . = 2) indicate that correction procedures are recommended to mitigate the detected biases. based on samples with size π = 50, and randomly selected from various continuous probabilistic distributions (infinite populations). Using the estimator πΊ , horizontal and vertical dotted lines are fixed, respectively, at |π΅π | = 0.1 and at the first expected value with |π π΅| > 2%.
In Figure 8 we suggest a simulation-based criterion for deciding when to use bias correction procedures. This method is based on the expected values πΊ Μ . and πΎΜ , since the Gini index and the skewness have a direct effect on the bias. Samples, with sizes between 50 and 1000, are drawn from the most skewed probabilistic distributions described in Section 3.
Using the estimator πΊ , this criterion is based on conditions: |π΅π | β₯ 0.1 and |π π΅| β₯ 2%. A grading scale classifies the non-negligible biases into three categories: mild (2 β€ |π π΅| < 5), moderate (5 β€ |π π΅| < 10) and severe (|π π΅| β₯ 10). This scale can be used to identify the scenarios where bias corrections are Pareto |BR| 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Dagum-p20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Dagum-p0.5 Expected values of estimators |BR| 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.2 0.4 0.6 0.8 Lognormal Expected values of estimators 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.1 0.2 0.3 0.4 0.5 G n c G n c.Jo G n c.Bp
either weakly or strongly recommended. Thus, while bias is not a serious issue for mild biases, the use of bias correction procedures is suggested to reduce this bias. Bias corrections are highly recommended in the case of moderate biases. The bias is a serious problem in the presence of severe biases, meaning bias corrections are strongly advised. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 3.0 3.5 4.0 n=50 Expected skewness 0.2 0.3 0.4 0.5 0.6 0.7 0.8 6.5 7.5 8.5 n=200 0.2 0.3 0.4 0.5 0.6 0.7 0.8 10 11 12 13 n=500 Expected values of G n c Expected skewness 0.2 0.3 0.4 0.5 0.6 0.7 0.8 15 16 17 18 19 n=1000 Expected values of G n c 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Mild
Moderate Severe are expected when πΊ βΎ . β₯ 0.39, and moderate biases are obtained when πΊ βΎ . β₯ 0.47. For samples with size π = 500, mild and moderate biases are expected when πΊ βΎ . β₯ 0.48 and πΊ βΎ . β₯ 0.56, respectively. Finally, for π = 1000, bias corrections could be applied when πΊ βΎ . β₯ 0.49, and moderate biases are observed when πΊ βΎ . β₯ 0.57. Severe biases are not expected when π β₯ 1000 and πΊ β€ 0.8 (the maximum true Gini index considered in this study). In summary, for small samples sizes (e.g., π = 50), bias correction procedures may be required when the estimates of the Gini index are greater than 0.2, and can be highly recommended when they exceed 0.37. For larger sample sizes (e.g., π = 500), mild biases can be expected when the estimates of the Gini index are greater than 0.48, and bias corrections are highly advisable when the estimates of the Gini index are larger than 0.56. For different sample sizes and Gini indices associated with specific data that empirical researchers are analysing, the aim of Figure 8 is to depict the values of the coefficient of skewness that would require the use of bias corrections.
Applications to real data sets
In this section, bias correction procedures are applied for estimating the Gini index in a total of six subpopulations with sizes between 26 and 503, and derived from three real data sets (see Table 1 ). A common goal in most surveys is not only to provide estimates for the whole population, but also for specific subpopulations (also named domains). For instance, estimates of unemployment in labour-force surveys are provided at national level, but this information is also of special interest at provincial and local levels. In household surveys, subpopulations are usually created on the basis of household sizes or consumption units. Age, sex and occupational groups are also often used to create subpopulations in many studies.
The first real data set consists of total net household incomes extracted from the 2019 Spanish Survey on Income and Living Conditions (ES-SILC). Subpopulations, with sizes π = {26,51}, are created using different consumption units, with the aim of using the Gini index to estimate income inequality. The second real data set is obtained from the World Bank's Enterprise Survey (WBES), which has been used extensively in international management studies (Vendrell-Herrero et al., 2022; Gomes et al., 2018; etc.) . For private sector firms from over 130 developed and developing countries, the WBES contains information on a broad range of topics including competition performance, corruption, financial data, infrastructure, technology, etc. Using this survey, we estimate the Gini index of the labour productivity per hour worked in Argentinean firms for the years 2017 and 2018. The sizes of the resulting subpopulations are π = {61,503}.
Finally, the third real data set (named WATER) consists of a survey on shower habits conducted in Andalusia, a region in southern Spain facing water scarcity. The interest is to analyse the inequality in time spent showering, creating subpopulations, with sizes π = {38,74}, using the number of inhabitants at provincial level.
The bias corrected estimator πΊ .
is based on a continuous probabilistic distribution, and for this reason the Kolmogorov-Smirnov (KS) Goodness of Fit test is used to fit distributions to the various subpopulations used in this study. From Table 1 we observe that the Lognormal, Fisk, Dagum and Weibull distributions yield KS p-values above the usual significance level (5%), and the null hypothesis that data come from the corresponding continuous probabilistic distribution is not rejected. For the various subpopulations, the simulation-based criterion described in Section 6 indicates that the use of a bias correction procedure is recommended, since nonnegligible biases and BRs greater than 0.1 are expected according to Figure 8 . In particular, the estimator πΊ is expected to underestimate the true Gini index, with higher estimates expected from the bias corrected estimators. is based probabilistic distributions that fit the data.
For the subpopulation with size π = 51 derived from ES-SILC, we observe that estimates of the coefficient of skewness and the Gini index are, respectively, πΎ = 3.51 and πΊ = 0.476. These results indicate that moderate biases are expected according to Figure 8 . As we expected, bias corrected estimators provide higher estimates than πΊ , with values as much as 4.8% larger than πΊ = 0.476 (see the estimation of πΊ . based on the Dagum distribution). For π = 503 in the WBES population, the estimates πΎ = 15.85 and πΊ = 0.734 indicate the presence of serious biases, and the difference with respect to the estimator πΊ goes from 2.5% (πΊ . = 0.752) to 6.8% (πΊ . = 0.784). For the various subpopulations in this study, we observe that estimates derived from the bias correction procedures are larger than estimates based on πΊ , a result which coincides with the findings of Sections 3 and 5.
Discussion
The Gini index is a very popular indicator to measure inequality that has been used in many economic studies. For discrete distributions, the Gini index is usually estimated using a plug-in formulation of a given theoretical definition of the Gini index for continuous distributions. This methodology may introduce a serious bias in comparison to the true (asymptotic) value of the Gini index. Note that the Gini index can also be estimated using techniques such as empirical likelihood (Owen, 2001) , but there is no simple application of this method to complex sampling designs. The analysis of alternative estimation methodologies is beyond the scope of this paper, i.e., we assume the classical formulations derived from theoretical definitions of the Gini index.
First, this paper attempts to provide a better overview of the problem of estimating the Gini index by regrouping and classifying the most common empirical versions proposed for discrete distributions, and defined under the two existing statistical theories (infinite and finite populations). Second, this paper identifies the scenarios where the bias may be a serious issue, and such scenarios are based on common continuous distributions often used in the modelling of income distributions. For instance, πΊ (denoted as πΊ in finite populations) yields large biases when the Gini index and the sample size are small, but this bias problem can be easily solved by using the midpoint distribution function in the definition of πΊ . When all the sample observations are different, another solution is to use one of the transformations described in Equations ( 8 ) and ( 9 ). In addition, results derived from this study indicate that the various empirical versions of πΊ produce serious biases in the presence of heavy-tailed distributions and large Gini indices. Accordingly, bias correction procedures are suggested to mitigate this bias problem, and they are investigated using Monte Carlo simulation studies. We also describe a simulation-based criterion for deciding when to use bias corrections. Finally, bias corrected procedures are illustrated by application to the problem of estimating the Gini index in various real data sets.
The empirical bootstrap obtains less biased estimates than alternative bias correction procedures. With infinite populations, the traditional jackknife performs well in terms of relative bias. For finite populations, the rescaled bootstrap may reduce the bias of the existing empirical versions of πΊ. It is important to note that the empirical bootstrap is a parametric procedure that requires generating sets of data from the probabilistic distribution fitted to the original sample. However, the use of continuous distributions in the modelling of income distributions is a common practice in many real-world applications, and the empirical bootstrap can thus be implemented if this is the case. In addition, it should be noted that for the sake of simplicity the empirical bootstrap bias correction is based only on standard regression functions, but alternative bias functions can also be used, and they may potentially improve the performance of this method. Finally, the empirical bootstrap is more computationally intensive than alternative procedures, but this is not a problem with current computing facilities.
The outcome of the grading scale described in Section 6 can help empirical researchers decide whether the specific data they are analysing have non-negligible biases and large bias ratios, meaning the use of bias corrections would therefore be recommended. For heavy-tailed distributions, non-negligible biases may appear in small samples (e.g., π = 50) from low estimates of the Gini index (e.g., πΊ β₯ 0.2). For samples with sizes π = 200 and π = 1000, nonnegligible biases can be expected for estimates of the Gini index greater than 0.4 and 0.5, respectively. Severe biases are not expected when the sample size is larger than 1000. Figure 8 gives a more precise understanding of the conditions required in practice to apply a bias correction, which depend on the sample size and estimates of both the coefficient of skewness and the Gini index.
Both bias and MSE measures are important to evaluate the quality of estimators. Numerous authors indicate that the use of bias correction procedures may have an impact on the efficiency of bias corrected estimators. This issue has also been investigated in this paper, with the results indicating that said impact is not relevant, especially as the sample size increases. The empirical bootstrap is more efficient than alternative bias correction procedures, but slightly less efficient than the customary empirical versions of πΊ, and may even have the smallest MSEs for large Gini indices. Conventional advice in the literature is to avoid estimators that are considerably biased, so empirical researchers should seek estimators with smaller biases, and then choose one with a small variance. Following this idea, the empirical bootstrap can be good choice for estimating the Gini index in the scenarios discussed in Sections 5 and 6. However, alternative bias correction procedures also perform well in terms of bias and efficiency in many situations, and they may be preferable in terms of simplicity.
For less skewed distributions (e.g., Weibull and Gamma), the bias of πΊ is not a problem, and the bias of πΊ lies within a reasonable range. This implies that bias correction procedures are not required for less skewed distributions. Bias corrections are applied to πΊ because it shows the best performance in this study. However, such procedures can easily be applied to any other estimation method in the literature.
The observed biases may have an important impact on the coverage rates of confidence intervals of the Gini index, especially in the case of the large bias ratios obtained by the estimator πΊ . This implies that bias corrected estimators are highly recommended for the construction of confidence intervals, since they can be invalid and/or undesirable coverage probabilities can be obtained in the case of moderate or severe biases. Large biases are also observed by the bias corrected estimators in the case of large Gini indices and highly skewed distributions. These arguments represent promising directions for future research. For instance, the interval estimation based on bias corrected estimators can be investigated to analyse when such confidence intervals have desirable empirical coverages. Alternative estimation methodologies can also be used to improve the estimation of the Gini index. In particular, it would be interesting to reduce the biases that still remain in the aforementioned extreme situations (highly skewed distributions with large Gini indices). For instance, information from auxiliary variables can be incorporated at the estimation stage, and more accurate results are expected.
Government of Andalusia and the European Regional Development Fund (project P18-RT-576) and two grants of the University of Granada (Unidad CientΓfica de Excelencia "Desigualdad, Derechos Humanos y Sostenibilidad -DEHUSO" del Plan Propio; and Programa de Ayudas a la revisiΓ³n de textos cientΓficos de la Facultad de Ciencias EconΓ³micas y Empresariales) .
Figure 2 :
2Figure 3 :
3Figure 4 :Figure 5 :Figure 6 :
456Figure 7 :
7Figure 8 :
8Table 1 :
1| Population π | πΎ | πΊ | πΊ | . | πΊ | . | Distribution KS p-value |
| ES-SILC | 26 2.81 0.518 0.534 0.540 Fisk | 0.85 | |||||
| 0.526 Lognormal | 0.65 | ||||||
| 51 3.51 0.476 0.486 0.488 Fisk | 0.99 | ||||||
| 0.480 Lognormal | 0.62 | ||||||
| 0.499 Dagum | 0.11 | ||||||
| WBES | 61 3.22 0.505 0.513 0.509 Lognormal | 0.38 | |||||
| 0.516 Fisk | 0.37 | ||||||
| 503 15.85 0.734 0.784 0.752 Fisk | 0.57 | ||||||
| WATER | 38 3.92 0.358 0.368 0.370 Dagum | 0.07 | |||||
| 0.359 Weibull | 0.06 | ||||||
| 0.366 Fisk | 0.06 | ||||||
| 0.363 Lognormal | 0.05 | ||||||
| 74 3.09 0.435 0.439 0.442 Fisk | 0.10 | ||||||
| 0.442 Dagum | 0.10 |
References
- Inequality and poverty in Malaysia: Measurement and decomposition, by Sudhir Anand. New York: Oxford University Press, 1983, 371 pp. Price: $27.50 S Anand 10.1002/pam.4050030242 Journal of Policy Analysis and Management J Policy Anal Manage 0276-8739 1520-6688 3 2 1983 Wiley
- Pareto and the upper tail of the income distribution in the UK: 1799 to the present A B Atkinson Economica 84 334 2017
- Income modeling with the Weibull mixtures S A A Bakar D Pathmanathan Communications in Statistics-Theory and Methods 2020
- A note on the asymptotic equivalence of jackknife and linearization variance estimation for the Gini Coefficient Y G Berger Journal of Official Statistics 24 4 2008
- Confidence intervals of Gini coefficient under unequal probability sampling Y Berger Δ° Gedik Balay Journal of Official Statistics 36 2 2020
- On estimating quantiles using auxiliary information Y G Berger J F MuΓ±oz Journal of Official Statistics 31 1 2015
- A jackknife variance estimator for unequal probability sampling Y G Berger C J Skinner Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 1 2005
- Working from home and income inequality: risks of a βnew normalβ with COVID-19 Luca Bonacini Giovanni Gallo Sergio Scicchitano 0000-0003-1015-7629 10.1007/s00148-020-00800-7 Journal of Population Economics J Popul Econ 0933-1433 1432-1475 34 1 2021 Springer Science and Business Media LLC
- A Method of measuring inequality within a selection process N Bulle Sociological Methods & Research 45 1 2016
- A different view of finite population estimation C Campbell Proceedings of the Survey Research Methods Section the Survey Research Methods Section ASA 1980. 1980
- Horizontal inequality and data challenges C Canelas R M Gisselquist Social Indicators Research 143 1 2019
- Pareto's law of income distribution: Evidence for Germany, the United Kingdom, and the United States F Clementi M Gallegati 10.1007/88-470-0389-X_1 Econophysics of Wealth Distributions. New Economic Windows A Chatterjee S Yarlagadda B K Chakrabarti Milano Springer 2005
- Mathematical Methods of Statistics H Cramer 1957 Princeton University Press Seventh Printing, Princeton
- Reliable inference for the Gini index R Davidson of Econometrics 150 1 2009
- bias of the Gini coefficient: results and implications for empirical research G Deltas Review of Economics and Statistics 85 1 2003
- Variance estimation for complex statistics and estimators: Linearization and residual techniques J C Deville Survey Methodology 25 1999
- On measuring skewness and kurtosis D DoriΔ E NikoliΔ-DoriΔ V JevremoviΔ J MaliΕ‘iΔ Quality Quantity 43 3 2009
- More efficient bootstrap computations B Efron Journal of the American Statistical Association 55 1990
- An introduction to the bootstrap B Tibshirani R 1993 Chapman and Hall New York, London
- Small area estimation of the Gini concentration coefficient E Fabrizi C Trivisano Computational Statistics & Data Analysis 99 2016
- Calculating a standard error for the Gini coefficient: some further results C Gini E Pizetti D E Giles Memorie di metodologica statistica 1912. 2004 66 Reprinted in VariabilitΓ e mutabilitΓ
- The Gini concentration index: a review of the inference literature G M Giorgi C Gigliarano Journal of Economic Surveys 31 4 2017
- Testing the selfselection theory in high corruption environments: evidence from African SMEs E Gomes F Vendrell-Herrero K Mellahi D Angwin C M Sousa 2018 International marketing review
- On parametric bootstrap methods for small area prediction P Hall T Maiti Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 2 2006
- On Gini's mean difference and Gini's index of concentration G Jasso American Sociological Review 44 5 1979
- Linking input inequality and outcome inequality G Sociological Methods & Research 50 3 2021
- Bias correction with jackknife, bootstrap, and taylor series J Jiao Y Han IEEE Transactions on Information Theory 66 7 2020
- Variance estimation of the Gini index: revisiting a result several times published M Langel Y TillΓ© Journal of the Royal Statistical Society: Series A (Statistics in Society) 176 2 2013
- Decomposing the Gini inequality index: An expanded solution with survey data applied to analyze gender income inequality Larraz Sociological Methods & Research 44 3 2015
- Spatial aggregation and resampling expansion of big surveys: An analysis of wage inequality B Larraz J M PavΓa M Herrera-GΓ³mez Regional Science Policy & Practice 13 3 2020
- Beyond the gender pay gap B Larraz J M PavΓa L E Vila Convergencia 81 2019
- Methods of measuring the concentration of wealth M O Lorenz Publications of the American Statistical Association 9 70 1905
- Statistical Inference for Measures of Inequality With a Cross-National Bootstrap Application Timothy P Moran 10.1177/0049124105283117 Sociological Methods & Research Sociological Methods & Research 0049-1241 1552-8294 34 3 2006 SAGE Publications
- Rescaled bootstrap confidence intervals for the population variance in the presence of outliers or spikes in the distribution of a variable of interest P J Moya J F MuΓ±oz E Γlvarez-Verdejo F J Blanco-Encomienda Communications in Statistics-Simulation and Computation 2020
- On estimating the poverty gap and the poverty severity indices with auxiliary information J F MuΓ±oz E Γlvarez-Verdejo R M GarcΓa-FernΓ‘ndez Sociological Methods & Research 47 3 2018
- J F MuΓ±oz P J Moya E Γlvarez-Verdejo 10.17605/OSF.IO/4YNBS R codes for estimators of the Gini index 2023
- A convenient method of computing the Gini index and its standard error T Ogwang Oxford Bulletin of Economics and Statistics 62 1 2000
- Empirical likelihood A B Owen 2001 Chapman and Hall/CRC
- Using the Dagum model to explain changes in personal income distribution C G PΓ©rez M P Alaiz Applied Economics 43 28 2011
- Empirical bootstrap bias correction and estimation of prediction mean square error in small area estimation D Pfeffermann S Correa Biometrika 99 2 2012
- About capital in the twenty-first century T Piketty American Economic Review 105 5 2015
- Empirical likelihood confidence intervals for the Gini measure of income inequality Y Qin J N K Rao C Wu Economic Modelling 27 6 2010
- The bootstrap method in survey sampling A Quatember Pseudo-Populations Cham Springer 2015
- Some recent work on resampling methods for complex surveys J N K Rao C F J Wu K Yue Methodology 18 1992
- A convenient descriptive model of income distribution: the gamma density A B Salem T D Mount Econometrica: Journal of the Econometric Society 1974
- Model assisted sampling C E SΓ€rndal B Swensson J Wretman 2003 Springer Science & Business Media
- Poverty, inequality and unemployment: Some conceptual issues in measurement A Sen Economic and Political Weekly 1973
- Reducing socioeconomic inequalities in the European Union in the context of the 2030 Agenda for Sustainable Development A SzymaΕska Sustainability 13 13 7409 2021
- The determinants of income inequality in OECD countries Pasquale Tridico 10.1093/cje/bex069 Cambridge Journal of Economics 0309-166X 1464-3545 42 4 2018 Oxford University Press (OUP)
- A simple correction to remove the bias of the Gini coefficient due to grouping T Van Ourti P Clarke Review of Economics and Statistics 93 3 2011
- Home-market economic development as a moderator of the self-selection and learning-by-exporting effects F Vendrell-Herrero C K Darko E Gomes D W Lehman Journal of International Business Studies 2022
- Use of a Gini index to examine housing price heterogeneity: A quantile approach J G Villar J M Raya Journal of Housing Economics 29 2015
- Changes in regional inequality in rural China: decomposing the Gini index by income sources G H Wan Australian Journal of Agricultural and Resource Economics 45 3 2001
- Comparison of Ferguson's πΏ and the Gini coefficient used for measuring the inequality of data related to health quality of life outcomes H Y Wang W Chou Y Shao T W Chien Health and Quality of Life Outcomes 18 2020
- Jackknife empirical likelihood confidence interval for the Gini index D Wang Y Zhao D W Gilmore Statistics & Probability Letters 110 2016
- Introduction to variance estimation K Wolter 2007 Springer Science & Business Media
- Simple single-stage sampling methods C Wu M E Thompson Sampling Theory and Practice 17-31) Cham Springer 2020
- Improvements in ability to detect undiagnosed diabetes by using information on family history among adults in the United States Q Yang T Liu R Valdez R Moonesinghe M J Khoury American Journal of Epidemiology 171 10 2010
- More than a dozen alternative ways of spelling Gini S Research on Economic Inequality 8 1998
Metadata
Issues
No public issues have been filed for this DOI.
Submit an issue
Record history
| When | Event | Field | Old | New |
|---|---|---|---|---|
| 2026-06-18 19:37:53.011249+00:00 | identifier_assigned | DSEID | DSEID-001-2360116 | |
| 2026-06-18 15:18:58.204395+00:00 | pdf_processed | pdf_sha256 | 34424ff295f543e6b601ab2172d084a5bf95af518d41dedbc319fe152c49d3a4 |