Published online 5 April 2007
Published in Soil Sci Soc Am J 71:656-668 (2007)
DOI: 10.2136/sssaj2006.0173
© 2007 Soil Science Society of America
677 S. Segoe Rd., Madison, WI 53711 USA
SOIL PHYSICS
A Random-Path Markov Chain Algorithm for Simulating Categorical Soil Variables from Random Point Samples
Weidong Li* and
Chuanrong Zhang
Dep. of Geography, Kent State Univ., Kent, OH 44242
* Corresponding author (weidong6616{at}yahoo.com).
 |
ABSTRACT
|
|---|
Quantitative prediction and simulation of categorical soil variables from limited samples are crucial for cost-effectively acquiring exhaustive area-class maps. Conventional methods, however, usually cannot meet all of the requirements for class simulation in incorporating interclass relationships and generating polygon features. We developed a random-path sequential simulation algorithm based on the Markov chain random field theory. Our objective was to find a suitable method for predictive area-class soil mapping from irregularly distributed point samples, and thus extend Markov chains into practical nonlinear geostatistics. The algorithm was used to simulate soil type maps conditioned on three different sample data sets, and was compared with the widely used indicator kriging simulation algorithmsequential indicator simulation with ordinary indicator kriging (SISoik). Results show that the algorithm works well with both dense and sparse random samples in reproducing all classes and input statistics. Compared with SISoik, the algorithm has the following advantages: (i) it more effectively captures complex patterns of soil classes and obeys their interclass relationships; (ii) it generates lower spatial uncertainty and more accurate realizationsfor example, the relative increases in average percentage of correctly classified locations of realizations for the sparse, medium, and dense data sets are 5.0, 9.9, and 8.5%, respectively; (iii) it generates polygon features in realizations in accordance with the style of area-class soil maps; and (iv) it can generate classes missed in sampling but confirmed by experts. We concluded that the algorithm provides a practical spatial statistical tool for prediction and simulation of categorical soil spatial variables.
Abbreviations: AMP, average maximum probability CPD, conditional probability distribution IK, indicator kriging MCG, Markov chain geostatistics MCRF, Markov chain random field MCSS, Markov chain sequential simulation OIK, ordinary indicator kriging PCC, percentage of correctly classified locations SIK, simple indicator kriging SIS, sequential indicator simulation SISoik, sequential indicator simulation with ordinary kriging TPM, transition probability matrix
 |
INTRODUCTION
|
|---|
Soil properties (or variables) are usually classified into multiple classes (or intervals) and delineated as area-class (i.e., polygon) maps. Area-class soil maps are widely used in soil science and they are crucial data for natural resources management and environmental studies. Because exhaustive field survey and sampling for accurate soil mapping are impractical, soil area-class mapping has to be based on limited samples and expert interpretations. Thus, hand-delineated maps are inevitably subject to map creators' personal understanding of the studied areas and classes. Furthermore, a single soil map cannot reflect the inherent spatial uncertainty of soil occurrence at unsampled locations. Therefore, it is desirable to use quantitative stochastic simulation techniques for soil mapping because they may avoid subjectivity in map creation and reveal the spatial uncertainty associated with soil classes mapped from limited samples.
Recently, a triplex Markov chain conditional simulation method was proposed for simulating soil classes from survey line data in the horizontal two dimensions (Li et al., 2004). While its practicality is limited, the method has demonstrated some special capabilities of multidimensional Markov chain models in dealing with soil classes. For example, interclass relationships (i.e., cross-correlations, neighboring relationships, and directional asymmetry of class patterns) can be simply incorporated into simulations through cross transition probabilities (i.e., transition probabilities between different classes) and large-scale soil patterns may be captured effectively when a simulation is conditioned on relatively abundant survey line data. To further extend Markov chains into a Markov-chain-based practical geostatistical technique, several major issues have to be solved, as mentioned in Li et al. (2004, p. 14881490) and Li and Zhang (2005): (i) parameter estimation from point samples; (ii) underestimation of small classes in simulated realizations (here small classes refers to classes that have area proportions less than the average); and (iii) algorithm design for dealing with point samples, particularly randomly or irregularly distributed point samples.
With the proposition of the transiogram as a new spatial relationship measure for categorical data (Li, 2006), the parameter estimation problem in Markov chain conditional simulation was solved by using transiogram models as input parameters. Transiogram models can be estimated from various data types (Li and Zhang, 2005, 2006). Recently, Li (2007a) further proposed a Markov chain random field (MCRF) theory, which proved the small-class underestimation problem of the coupled Markov chain model and solved it in Markov chain conditional simulation. The MCRF theory proposes a single-chain-based Markov random field and estimates unsampled locations by considering the distant (or high-order) interactions of the unknown location being estimated with its nearest known neighbors in different directions. Thus it differed from earlier multidimensional Markov chain models (for linear models, see Lin and Harbaugh, 1984; Luo, 1996; Carle and Fogg, 1997; for nonlinear models, see Elfeki and Dekking, 2001; Li et al., 2004), which used multiple one-dimensional Markov chains in multidimensional simulations, and also differed from Markov mesh models (sometimes also called Markov chain models in image analysis; e.g., Qian and Titterington, 1991; Gray et al., 1994; Wu et al., 2004) and conventional Markov random field models (e.g., Geman and Geman, 1984; Norberg et al., 2002), which used cliques in simulation. Because a MCRF contains only one chain in a space, it does not involve unwanted transitions and therefore does not underestimate small classes in simulated realizations. Here the term unwanted transitions refers to transitions of multiple chains to the same location with unequal states. Because these transitions cannot get a common state for the unsampled location to be estimated, they have to be excluded in building multiple-chain-based multidimensional Markov chain models (see Elfeki and Dekking [2001, Fig. 3] for exclusion of unwanted transitions in the coupled Markov chain model). While the MCRF theory provides the theoretical backbone of Markov chain geostatistics (MCG), effective applications of this new geostatistical idea to real world issues actually lie with the development of practical simulation algorithms. Therefore, a major task for the development of MCG is the algorithm design for dealing effectively with randomly or irregularly distributed point sample data, because random sampling is the traditionally and widely used sampling scheme in soil survey, and most existing soil sample data sets are random or irregular samples.

View larger version (35K):
[in this window]
[in a new window]
|
Fig. 3. The reference soil map of seven soil classes and randomly distributed sample data sets (dense: 646 points, 2.9% of total pixels; medium: 179 points, 0.8% of total pixels; sparse: 50 points, 0.2% of total pixels). The seven soil classes (soil series) include: 1, loamy alluvium; 2, Sparta; 3, Plainfield; 4, Dakota; 5, Richwood; 6, others (e.g., water); and 7, peat and muck.
|
|
In this study, we designed a MCRF-based random-path sequential simulation algorithm for simulating soil classes from randomly distributed point samples, and such an algorithm is also applicable to regular samples. We used different random sample data sets to test the effectiveness of the algorithm. The objective of the Markov chain sequential simulation (MCSS) algorithm design process was to develop the MCG idea into a practical technique for predictive soil mapping and soil spatial uncertainty analysis so that MCG becomes a widely applicable new geostatistical approach. To demonstrate the advantages and special features of the algorithm, we used the widely used indicator kriging (IK) simulation algorithmthe sequential indicator simulation (SIS) with ordinary indicator kriging (OIK) (see introductions in Goovaerts, 1997, p. 284300, 376378), denoted as SISoik here, to conduct simulations on the same data sets for a comparison study.
 |
MATERIALS AND METHODS
|
|---|
Markov Chain Random Field
The current major contents of MCG include the MCRF, the transiogram, and related simulation algorithms for dealing with different data types and information. The MCRF provides theoretically sound Markov-chain-based estimators (i.e., models) for conditional simulations on sample data. The transiogram was designed as a spatial measure of MCRFs and provides a flexible way for estimating transition probabilities with continuous lags from samples, which are needed by Markov chain conditional simulation models.
A MCRF refers to a random field of a single Markov chain that moves (or jumps) in a one- or multidimensional space randomly or following a prescribed path (Li, 2007a). The state of such a Markov chain at an unknown location is decided by the interactions of the Markov chain with its nearest known neighbors in different directions. The use of a single chain, rather than conventional multiple chains, avoids the occurrence of unwanted transitions and consequently overcomes the small-class underestimation problem that occurred in simulated realizations of the triplex Markov model suggested by Li et al. (2004).
In a MCRF, a Markov chain may jump to anywhere and decide its current state by considering the influence of nearest known neighbors in different directions. The conditional probability distribution (CPD) of a random variable Z at an unknown location u in a MCRF was derived (Li, 2007a) as
 | [1] |
where pklii(hi) represents a transition probability in the ith direction from state k to state li with a lag hi; u1 represents the neighbor from or across which the Markov chain moves to the current location u; m represents the number of nearest known neighbors; k, li, and f all represent states in the state space S = (1, ..., n); hi is the distance from the current location to the nearest known neighbor ui. With the lag h increasing from zero to a large number, any pkl(h) makes a transiograma transition probability diagram. Note that as a single-chain random field, there is always a coming direction.
Equation [1] is the general solution of a MCRF. In real applications, it is not necessary to consider many nearest known neighbors in different directions (note that this is similar to the general solution of Markov random fieldsthe Gibbsian function, which is largely simplified in real application models). Since the influence of remotely located data to the unknown location to be estimated is normally screened by closer data within a certain angle, it is proper for MCRF-based specific Markov chain models to consider only the nearest known neighbors in several cardinal directions within a search radius. Li (2007a) proved that the nearest known neighbors in cardinal directions are conditionally independent of each other in a Pickard random field (Pickard, 1980) for the sparse-data situation. Thus, MCRF-based Markov chain models that only consider nearest known neighbors in cardinal directions are mathematically correct.
If we consider four cardinal directions, it implies that at the initial simulation it is possible that we find fewer than four nearest known neighbors from a sparse sample data set in cardinal directions within a search radius. This situation is always true for boundary locations. For different numbers (i.e., 04) of nearest known neighbors found in cardinal directions (see Fig. 1
), we can get all of those corresponding MCRF-based Markov chain models from the general solution as follows:
- If four nearest known neighbors in cardinal directions are found within the search radius, that means m = 4 in Eq. [1]. Then the Markov chain model can be drawn from Eq. [1] as
 | [2] |
where, h1, h2, h3, and h4 represent the distances from the unknown location u to its nearest known neighbors u1, u2, u3, and u4 in the four cardinal directions, respectively; and k, l, m, p, and o represent the states of the five locations u, u1, u2, u3, and u4, respectively. There is only one such case for four cardinal directions (see Fig. 1a).
- If three nearest known neighbors in cardinal directions are found within the search radius, that means m = 3 in Eq. [1]. Then the Markov chain model can be drawn from Eq. [1] as
 | [3] |
There are four such cases for four cardinal directions (see Fig. 1b).
- If two nearest known neighbors in cardinal directions are found within the search radius, that means m = 2 in Eq. [1]. Then the Markov chain model can be drawn from Eq. [1] as
 | [4] |
There are six such cases for four cardinal directions (see Fig. 1c).
- If one nearest known neighbor in cardinal directions is found within the search radius, that means m = 1 in Eq. [1]. Then the Markov chain model can be drawn from Eq. [1] as
 | [5] |
There are four such cases for four cardinal directions (see Fig. 1d).
- If no nearest known neighbor in cardinal directions can be found within the search radius (see Fig. 1e), that means m = 0 in Eq. [1]. Then no Markov chain model is needed and we can choose the state of an unknown location randomly based on the proportions of all classes, which may be estimated from sample data or expert knowledge. With a proper choice of the search radius, such a situation normally does not occur in a simulation.

View larger version (9K):
[in this window]
[in a new window]
|
Fig. 1. Possible neighborhoods within a search radius in a Markov chain random field if only considering nearest known neighbors in four cardinal directions: (a) four known neighbors; (b) three known neighbors; (c) two known neighbors; (d) one known neighbor; and (e) no known neighbors. The white cell represents the unknown one to be estimated and black cells represent known data locations in cardinal directions.
|
|
The Markov Chain Sequential Simulation Algorithm
For a randomly chosen unknown location in a simulation domain, its nearest known neighbors (sampled locations and previously simulated locations) may not lie exactly in the four orthogonal cardinal directions. Thus the so-called cardinal directions have to be distorted a little in this algorithm. Here we replace the four cardinal directions by four sectors of the search circle, each covering a quarter of the search circle, as shown in Fig. 2
. That is, in each sector we look for a nearest known neighbor. These nearest known neighbors in the four sectors constitute the Markov chain model for estimating the unknown location. In case there is no known location occurring in some sectors (e.g., at the beginning of simulation or on boundaries), the nearest known neighbors found will be less than four (e.g., Fig. 2, right).

View larger version (9K):
[in this window]
[in a new window]
|
Fig. 2. The search circle, search sectors, and the considered neighborhoods of nearest known neighbors. The search circle is divided into four equal search sectors. Left: four nearest known neighbors are found, one from each sector. Right: two nearest known neighbors are found because the other two sectors have no known data locations.
|
|
Based on the above principles, the MCSS algorithm consists of the following steps:- Step 1: Set a proper search radius for a simulation and split the entire nodes (or pixels) of the discretized simulation domain into a known-location set and an unknown-location set.
- Step 2: Randomly pick an unknown location from the unknown-location set.
- Step 3: Search for at most four nearest known locations within the search circle, one from each sector of the search circle if there are known locations in the sector, as shown in Fig. 2.
- Step 4: Estimate the CPD of the state of the unknown location using the corresponding Markov chain models provided in Eq. [25], depending on the number of nearest known neighbors found.
- Step 5: Draw a specific state for the unknown location from the CPD.
- Step 6: Add the newly simulated location into the known-location set for conditioning in subsequent simulations of other unknown locations and delete it from the unknown-location set.
- Step 7: Repeat Steps 2 to 6 until all unknown locations are visited and every unknown location has been assigned a simulated value.
To some extent, the above simulation algorithm is similar to the SIS algorithm of indicator kriging. The differences are that here we use MCRF models, not kriging, for estimating unknown locations and we also limit the nearest known neighbors to, at most, four and choose one, at most, from each of the four equal sectors (i.e., quarters) of the search circle. As to how to choose the search radius, it is a decision of users based on the density of their samples, and it should avoid using such a small search radius that the search circle frequently covers no nearest known neighbor at the initial stage of a simulation.
Transiogram Modeling
The transiogram concept was introduced in Li (2006) and a systematical introduction was provided in Li (2007b). Here a brief introduction is provided. Theoretically, a transiogram is defined as a two-point transition (or conditional) probability function over the lag h:
 | [6] |
where Z is a random variable and x represents one specific location. Assuming Z is spatially stationary, we have pij(h) dependent only on the lag h and not dependent on the specific location x. An autotransiogram pii(h) represents the self-dependence (i.e., auto-correlation) of a single class i and a cross-transiogram pij(h) (i
j) represents the cross-dependence of class j on class i. Here class i is called a head class and class j is called a tail class. Practically, a transiogram can be directly estimated from sample data by counting the transition frequency from a class to itself or another class with different lags (e.g., numbers of pixels for raster data) by the following equation:
 | [7] |
where Fij(h) represents the frequency of transitions from class i to class j at the lag h, and n is the total number of classes. Such a transiogram is called an experimental transiogram. For sparse samples, to acquire reliable experimental transiograms, the lag h has to consider a tolerance width
h around it, which may be decided by users according to the density of samples. If anisotropies are considered, experimental transiograms have to be estimated directionally with or without a tolerance angle, similar to estimation of variograms. In addition, if one-step transition probabilities are available, transiograms can also be calculated from one-step transition probabilities based on the stationary first-order Markovian assumption, and such transiograms are called idealized transiograms (Li, 2006).
Experimental transiograms estimated from limited samples are normally scattered points and therefore cannot be used directly in simulations. There are two methods to acquire continuous transiogram models. The first one uses mathematical models to jointly fit experimental transiograms (Li and Zhang, 2006). This method is time consuming when the number of classes is large, but it permits incorporation of expert knowledge in estimation of transiogram models. Here expert knowledge refers to the knowledge of experts in parameter estimation of transiogram models, which typically include sills, ranges, and model types (e.g., exponential, spherical). Therefore, this method is more flexible and widely applicable, particularly when samples are sparse and cannot provide accurate experimental transiograms. The second method interpolates experimental transiograms into continuous models (Li and Zhang, 2005). This method is efficient but eliminates the chance of incorporating expert knowledge. Therefore, this method is suitable only when samples are sufficient and experimental transiograms are reliable.
In this study, for the algorithm testing purpose, the interpolation method for acquiring transiogram models was applied to avoid parameter uncertainty caused by the introduction of subjective expert knowledge. The linear interpolation method is given as the following equation:
 | [8] |
where A and B are the values of two neighboring points in an experimental transiogram, DA and DB are the corresponding lags of the two neighboring points with DB > DA, and X is the value to be interpolated at a lag DX between DA and DB. Interpolated transiogram models using Eq. [8] normally can meet the constraint conditions for the Markov chain simulation.
Incorporation of Expert Knowledge in Transiogram Modeling
Expert knowledge is useful in improving the quality of transiogram models when mathematical models (see Li and Zhang, 2006) are used to fit unreliable experimental transiograms estimated from insufficient sample data. Besides directly related expert knowledge about parameters (e.g., sill, range, model type, etc.) of transiogram models, indirectly related expert knowledge about studied soil classes (e.g., proportions, polygon sizes, juxtapositions) in the study area and information from similar areas is also important. The expert knowledge may be finally embodied in the parameters of a set of transiogram models. The relationships between the expert knowledge and transiogram model estimation are briefly discussed as follows:
- Proportions. With decreasing numbers of samples, the proportion of a class in the sample data set deviates from the correct proportion of the class in the study area. In this situation, expert knowledge in the proportion of the class is crucial in setting the correct sills of transiogram models tailed by the class. As shown in Li and Zhang (2006), the sill of a transiogram model pij(h) may be set to the proportion pj of the tail class j; this is because the sill of pij(h) is theoretically equal to the proportion of tail class j (Carle and Fogg, 1997).
- Mean lengths. The mean length
i (i.e., mean polygon size) of parcels of a class is theoretically related to the slope of the idealized autotransiogram pii(h) at the beginning point [i.e., point (0, 1)] (Carle and Fogg, 1997; Li, 2006). This property is applicable to transiogram models. So if the mean length of parcels of a class can be assessed (e.g., estimated by expert knowledge or from a training image), one may determine the change gradient of the autotransiogram model of the class at the beginning point. If the approximate correlation range is also known, the autotransiogram model may be approximately inferred because autotransiograms usually tend to be exponential distribution curves.
- Correlation range. The relationship between autocorrelation range ai, mean length, and proportion can be generalized as
 | [9] |
where
= 1, 1.5, or 3 for linear, spherical, or exponential autotransiogram models (Ritzi, 2000). So if the parcel mean length, class proportion, and the transiogram curve shape (corresponding to transiogram model type) are known, one can obtain the auto-correlation range.
- Juxtaposition. If two soil types often occur as neighbors, their cross-transiograms should have a peak at the low lag section. On the contrary, if two soil types often appear distantly, their cross-transiograms should have low values at the low lag section (Li, 2006). This expert knowledge may also be incorporated into transiogram modeling with suitable mathematical models.
- Domain knowledge. For specific soil types, related transiograms may have similar characteristics in similar natural environments. By estimating transiograms from similar areas with exhaustive maps or detailed survey data, one can build a knowledge base about common characteristics of transiograms of specific soil types. This can also be used to guide the inference of transiogram models for applications.
In general, if the autocorrelation range, the parcel mean length, and the proportion of a class are known, one can define a basic autotransiogram model. Correlation ranges and transiogram model types may be approximately assessed from experimental transiograms. Because cross-transiograms are typically more complex than autotransiograms in curve shapes, it is particularly necessary to estimate cross-correlation ranges aij and general shapes of cross-transiograms from measured data, training maps, available abundant data in similar areas, and expert knowledge in juxtaposition situations of studied classes.
Because Markov chain models always consider interclass relationships through cross transition probabilities (or cross-transiogram models), it is possible to capture a minor class missed in sampling but confirmed by experts familiar with the study area. The requirement is that there must be sufficient available expert knowledge in proportion, parcel mean length, and spatial distribution characteristics (e.g., juxtaposition relationships with other classes) of the minor class, because with that expert knowledge a set of transiogram models related to the minor class may be inferred approximately. Correspondingly, because of the addition of the minor class, transiogram models (usually model sills) of some other classes should be adjusted to meet the summing-to-one condition in joint modeling of transiogram models (see Li and Zhang, 2006).
Sequential Indicator Simulation
Indicator kriging was proposed by Journel (1983). Sequential indicator simulation is an efficient IK simulation algorithm for indicator variables (cutoffs or thresholds of continuous variables and categorical variables). It has been widely used and well documented in literature (e.g., Goovaerts, 1997, p. 376378). Therefore, it is not necessary to introduce them repeatedly here. Sequential indicator simulation may use either simple indicator kriging (SIK) or OIK (Goovaerts, 1997, p. 293297) as its estimator. The difference between SIK and OIK is that SIK considers the indicator mean known and constant across the whole study area, while OIK allows one to account for local variation of the indicator mean by limiting the domain of stationarity of the mean to the local neighborhood centered on the location being estimated. Since OIK is generally superior to SIK, in this study we uses SIS with OIK (i.e., SISoik), rather than SIS with SIK, to conduct simulations and compare simulated results with those generated by the proposed MCSS algorithm.
Thus far cokriging has been mainly used for incorporating secondary variables (see Goovaerts, 1997, p. 203258; Deutsch, 2006). To our knowledge, existing SIS algorithms (and related software) do not support indicator cokriging simulation of multinomial classes (see Deutsch, 2006). This could be caused by several reasons: (i) the indicator cokriging system for categorical variables cannot be solved as it is typically expressed due to the difficulty in defining a valid linear model of coregionalization for auto- and cross-variograms of multiple classes and thus it reduces to an IK system (see Bogaert, 2002, p. 430); or (ii) solving a large indicator cokriging equation system is also difficult because of demanding computation time, order relation deviation, and numerical instability (Goovaerts, 1996). In SIS, therefore, classes are typically estimated one after another; that is, when one class is estimated, data belonging to the class are coded as 1 and other data are coded as 0. So data belonging to other classes do not make contributions to the estimate of the current class at an unknown location to be estimated.
While SIS deals with single classes one by one, MCG algorithms always manage all involved classes simultaneously with consideration of cross-relationships between classes. Similar to SIK, MCRF models used in this study consider the indicator mean of each class constant across the whole study area (note that the indicator mean of a class is reflected in the height of the transiograms tailed by the class). As a nonlinear approach, however, MCRF models determine the conditional probability distribution of a random variable at an unknown location based on transition probabilities between the unknown location and its nearest known neighbors in several cardinal directions (or sectors).
Data Sets and Parameters
A 9-km2 area in the floodplain in northeastern Iowa County, Wisconsin, was chosen for a case study. The study area has seven soil classes (i.e., soil series) with the following names: 1, loamy alluvium; 2, Sparta; 3, Plainfield; 4, Dakota; 5, Richwood; 6, others (e.g., water); and 7, peat and muck. The soil map of Iowa County (in which the map unit is the soil phase) can be downloaded from the NRCS Soil Data Mart website (SoilDataMart.nrcs.usda.gov/, verified 9 Jan. 2007). Soil phases were merged into soil series in this study because this classification level is determined by pedogenetic properties. The clipped soil series map of the study area, shown in Fig. 3
, served as a reference map to validate simulated results. This means that we assumed the reference map is correct (i.e., delineated on a detailed field survey), and it may be impossible to acquire sucha reference map in real world applications. The study area was discretized into a grid of 22400 (175 by 128) pixels, with a pixel size of 20 by 20 m.
From the reference soil map, we sampled three data sets randomly: a dense one of 646 points (2.9% of total pixels), a medium one of 179 points (0.8% of total pixels), and a sparse one of 50 points (0.2% of total pixels) (Fig. 3). In fact, all three data sets are "sparse" samples, and the sparse sample data set is even too sparse to estimate transiogram models from. The terms dense, medium, and sparse are used to differentiate the three data sets based on their relative densities. Using these three different data sets, we may examine the simulated results conditioned on different densities of samples and verify the practicality of the proposed algorithm.
Figure 4
shows three subsets of omnidirectional experimental transiograms headed by Class 1, estimated from each of the three data sets. It can be seen that the three subsets of experimental transiograms estimated from different data sets look similar in their shapes. Experimental transiograms estimated from the sparse and medium datasets are less reliable, however, because they are composed of fewer transition probability value points, which have to be estimated by using a larger tolerance width. In addition, because Soil Class 5 does not appear in the sparse sample data set, transiogram p15(h) cannot be estimated from the sparse data set. So in simulations conditioned on all three sample data sets, we used only the transiogram models estimated from the dense data set, which include Class 5. This means that we used more correct transiogram models in simulations than we can acquire through fitting the experimental transiograms estimated from the corresponding sample data sets, and this can be justified by assuming that we have sufficient expert knowledge to adjust the transiogram models acquired directly from sparser data sets, as mentioned above.

View larger version (23K):
[in this window]
[in a new window]
|
Fig. 4. Experimental transiograms headed by Soil Class 1, estimated from each of the three sample data sets: (a) from the dense data set; (b) from the medium data set; and (c) from the sparse data set. Note that estimated points for each experimental transiogram are connected as a line so that different experimental transiograms can be readily differentiated.
|
|
Soil Class 5 is the smallest class in the study area. In the medium sample data sets, the class has only one point, and in the sparse sample data set, this class completely disappears. This means that as a minor class, it is missed in sampling for the sparse sample data set. In real world applications, this missing minor class is usually ignored if its existence is unknown. But sometimes we may want to capture it in a simulation if its existence is known in the study area by the experience and domain knowledge of soil experts. This is possible in the MCSS algorithm by using a set of auto- and cross-transiogram models including the missing minor class, which of course have to be estimated by using expert knowledge to provide parameters for the transiogram models (e.g., sill, range, and model type; Li and Zhang, 2006). This means all transiogram models that involved the missing class have to be defined by expert knowledge, as discussed above. Because in this study we have accurate transiogram models estimated from the dense data set, which includes the minor class, we can use them directly in the conditional simulation on the sparse data set and test whether a minor class missed in sampling can be captured in simulated realizations using the MCSS algorithm. Thus, we need not infer related transiogram models from expert knowledge. In addition, for the purposes of algorithm testing and comparison, it is also preferable to use objective parameters.
In this study, we used omnidirectional transiograms for simplicity, which means we did not consider anisotropies of soil classes in the study area. Transiogram models were obtained using the interpolation approach. Figure 5
displays a subset of transiogram models headed by Class 1, acquired from the experimental transiograms of the dense data set.

View larger version (21K):
[in this window]
[in a new window]
|
Fig. 5. A subset of experimental transiograms (dots) headed by Class 1, estimated from the dense data set, and interpolated models (solid lines).
|
|
To conduct simulations on the same data sets using SISoik, omnidirectional indicator variogram models were estimated from the data sets. For the same reasons, simulations used the indicator variogram models estimated from the dense data sets. Because cross-variograms were not used in simulations, only experimental indicator autovariograms were model fitted. Table 1 provides parameters for the seven indicator autovariogram models.
View this table:
[in this window]
[in a new window]
|
Table 1. Indicator autovariogram models of the seven soil classes in the study area, estimated from the dense data set (646 random points).
|
|
Outputs and Indices
The search radii chosen in simulations were 30, 50, and 50 pixel lengths (i.e., 600, 1000, and 1000 m) for the dense, medium, and sparse data sets, respectively. A relatively larger radius is necessary for the medium and sparse data sets because of the sparseness of samples. One hundred realizations were generated for each of the simulations on the three data sets using MCSS and SISoik, and occurrence probability maps for each simulation were estimated from those realizations. Optimal prediction maps were obtained from the maximum occurrence probability maps. Some indices are necessary to test the suitability and advantages of the MCSS algorithm. Except for visual maps, the following indices were used in this study:- PCC: Percentage of correctly classified locations. Average PCC values were estimated from 100 simulated realizations against the reference map.
- AMP: Averaged maximum occurrence probability. The AMP values were estimated from maximum occurrence probability maps.
- TPM: One-step transition probability matrix. The TPMs were estimated from simulated realizations and input transiogram models.
- Simulated transiogram: Simulated transiograms were estimated from single realizations.
- Proportion: Averaged proportions of classes were estimated from 100 realizations for each simulation.
 |
RESULTS AND DISCUSSION
|
|---|
Simulated Results
Some simulated results using MCSS conditioned on the three different random sample data sets are shown in Fig. 6
, 7
, and 8
, respectively. Apparently, both the optimal prediction map (Fig. 6a) and simulated realizations (Fig. 6b and 6c) conditioned on the dense data set effectively captured the spatial patterns of the seven soil classes. Optimal prediction maps (Fig. 7a and 8a) and simulated realizations (Fig. 7b, 7c, 8b, and 8c) conditioned on the medium and sparse data sets generated relatively simplified soil patterns because samples were sparser. Optimal prediction maps normally have a smoothing effect, so they may not reflect the true spatial heterogeneity of simulated classes and small classes may be underrepresented because of their lower occurrence probabilities at most unsampled locations. A simulated realization may represent a possible "reality" of the spatial distribution of the studied classes, however, based on the input statistics and sample data.

View larger version (50K):
[in this window]
[in a new window]
|
Fig. 6. Simulated results by Markov chain sequential simulation, conditioned on the dense data set: (a) optimal prediction map; (b) and (c) two simulated realizations; (d) maximum occurrence probability map; (e) occurrence probability map of Class 1; (f) occurrence probability map of Class 4.
|
|

View larger version (55K):
[in this window]
[in a new window]
|
Fig. 7. Simulated results by Markov chain sequential simulation, conditioned on the medium data set: (a) optimal prediction map; (b) and (c) two simulated realizations; (d) maximum occurrence probability map; (e) occurrence probability map of Class 1; (f) occurrence probability map of Class 4.
|
|

View larger version (55K):
[in this window]
[in a new window]
|
Fig. 8. Simulated results by Markov chain sequential simulation, conditioned on the sparse data set: (a) optimal prediction map; (b) and (c) two simulated realizations; (d) maximum occurrence probability map; (e) occurrence probability map of Class 1; (f) occurrence probability map of Class 4.
|
|
From Fig. 9a
, it can be seen that all classes were fairly reproduced in simulated realizations conditioned on the dense data set. Clearly, the small-class underestimation problem and the large-class overestimation problem that occurred in simulated realizations from the multiple-chain-based triplex Markov chain model did not occur here. On the contrary, it seems that the smallest class was overestimated a little in this case study. In fact, it is impossible and unnecessary to exactly reproduce the class proportions of samples. This is because the class proportions in simulated realizations represent a reconciliation result of many factors. Both conditioning data and transiogram models play roles in determining the class proportions in realizations. In addition, the boundary effect also has its contributions to class proportions in simulated realizations. Here the boundary effect means that samples close to boundaries of the study area are less utilized than those in the center of the study area in both transiogram estimation and simulations.

View larger version (26K):
[in this window]
[in a new window]
|
Fig. 9. Proportions of classes in sample data sets and simulated realizations (averaged from 100 realizations) generated by Markov chain sequential simulation.
|
|
Transiograms estimated from simulated realizations (the first 10 realizations were selected for each simulation) conditioned on the dense and medium sample data sets are provided in Fig. 10
and 11
, respectively. It can be seen that all experimental auto- and cross-transiograms (and transiogram models) are well reproduced by simulated realizations, except for those related to the minor Class 5. It is normal that, when samples are sparser, simulated transiograms have more fluctuations around the input ones due to the existence of more spatial uncertainty. This is called ergodic fluctuation in stochastic process theories, referring to the discrepancy between realization statistics and corresponding model parameters (Deutsch and Journel, 1998, p. 128132). Therefore, it can be concluded that simulated realizations of MCSS effectively reflect the heterogeneity of the studied classes and the inherent uncertainty of their spatial distribution resulting from incomplete observations.

View larger version (32K):
[in this window]
[in a new window]
|
Fig. 10. Transiograms estimated from single realizations generated by Markov chain sequential simulation, conditioned on the dense data set.
|
|

View larger version (36K):
[in this window]
[in a new window]
|
Fig. 11. Transiograms estimated from single realizations generated by Markov chain sequential simulation, conditioned on the medium data set.
|
|
The occurrence probability maps of single classes indicate where and with how much certainty a class may occur, and the maximum occurrence probability maps demonstrate the spatial uncertainty associated with each optimal prediction map. From Fig. 6d, 7d, and 8d, one can see that the "transition zones" (the white-gray stripes), which indicate the locations where predicted results from the samples have lower confidence than at other places, are demonstrated clearly in the maximum occurrence probability maps. These transition zones represent the approximate locations of polygon boundaries. Obviously, spatial uncertainties increase with decreasing density of samples, as shown by the increasing fuzziness of probability maps in Fig. 6, 7, and 8. Consequently, with decreasing numbers of samples, the patterns of optimal prediction maps become simpler and simulated realizations are less imitative of the reference map and of each other (see realizations in Fig. 6, 7, and 8).
Checking the simulated realizations, one can find that the MCSS algorithm generated polygon features with smooth boundaries in simulated realizations, no matter how sparse the conditional data set was. The polygon features of these realizations are in accordance with those of human-made area-class soil maps. This is interesting since our simulations are pixel based, not object based.
Despite not being contained in the sparse sample data set, Class 5 was still generated as neighbors of Class 1 in realizations conditioned on the sparse data set (see Fig. 8c). This is because the transiogram models used in the simulation included Class 5. This means that MCSS can capture a class that is missed in sampling, but assured by expert knowledge, merely through parameter inputs (i.e., transiogram models). This characteristic is useful for simulating multinomial classes with crucial minor classes from sparse samples based on sufficient expert knowledge. This is impossible to achieve, however, in simulation methods that only consider autocorrelations.
In addition, some class pairs such as Class 7 and Class 4 have little or no chance to be neighbors in the reference map. This characteristic of Classes 7 and 4 is also reflected in the dense sample data set and further reflected on their cross-transiogram models (see Fig. 12
). Checking simulated realizations (and optimal prediction maps) conditioned on all three of the data sets (see Fig. 6, 7, and 8), it can be seen that Classes 7 and 4 indeed never occur as neighbors. This means that simulated realizations by MCSS obey the interclass relationship rules defined by cross-transiogram models. Conventional geostatistics that do not incorporate interclass relationships may not follow the rules, however.

View larger version (20K):
[in this window]
[in a new window]
|
Fig. 12. The interclass relationships between Class 4 and Class 7, estimated from the dense data set. They have no or little chances to be neighbors.
|
|
Comparison with Indicator Kriging
To verify the effectiveness and advantages of a new approach, an appropriate method is to compare it with an existing, widely accepted approach. Here we chose the popular IK simulation algorithm SISoik as a reference, which was often cited as a potential candidate for simulating categorical variables and has been used in some case studies (e.g., Bierkens and Burrough, 1993; Goovaerts, 1996).
Figures 13
, 14
, and 15
show simulated results using SISoik conditioned on the three different random data sets, respectively. Visually comparing the optimal prediction maps with those generated by MCSS, it is difficult to tell their differences. Based on the reference soil map, we calculated their PCC values (Table 2). Although the absolute differences (0.3, 1.3, and 2.3% for the dense, medium, and sparse data set cases) of PCC values of the optimal prediction maps from the two approaches are not obvious, PCC values of the optimal prediction maps from MCSS are generally higher than those from SISoik, which means that MCSS can predict more correctly. The absolute differences of PCC values of simulated realizations from the two approaches are indeed obvious, however. The PCC values averaged from 100 MCSS realizations are 6.0, 5.9, and 2.5% higher than those from SISoik realizations for the dense, medium, and sparse data set cases, respectively. The relative difference in PCC values between realizations of the two methods even reached 9.9% for the medium sample data set. These results indicate that MCSS may better capture soil classes at their approximate locations and generate more accurate realizations.

View larger version (59K):
[in this window]
[in a new window]
|
Fig. 13. Simulated results by sequential indicator simulation with ordinary kriging, conditioned on the dense dataset: (a) optimal prediction map; (b) and (c) two simulated realizations; (d) maximum occurrence probability map; (e) occurrence probability map of Class 1; (f) occurrence probability map of Class 4.
|
|

View larger version (61K):
[in this window]
[in a new window]
|
Fig. 14. Simulated results by sequential indicator simulation with ordinary kriging, conditioned on the medium dataset: (a) optimal prediction map; (b) and (c) two simulated realizations; (d) maximum occurrence probability map; (e) occurrence probability map of Class 1; (f) occurrence probability map of Class 4.
|
|

View larger version (61K):
[in this window]
[in a new window]
|
Fig. 15. Simulated results by sequential indicator simulation with ordinary kriging, conditioned on the sparse dataset: (a) optimal prediction map; (b) and (c) two simulated realizations; (d) maximum occurrence probability map; (e) occurrence probability map of Class 1; (f) occurrence probability map of Class 4.
|
|
View this table:
[in this window]
[in a new window]
|
Table 2. Percentages of correctly classified locations (PCC) in optimal prediction maps and simulated realizations (averaged from 100 realizations) from MCSS and SISoik conditioned on the three data sets. The PCC values are estimated relative to the reference soil map.
|
|
While optimal prediction maps from the two approaches have no significant differences, surprisingly it can be found that their corresponding maximum occurrence probability maps are very different. Comparing Fig. 6d, 7d, and 8d with Fig. 13d, 14d, and 15d, respectively, one can see that the maximum occurrence probability maps from SISoik are apparently fuzzier than those from MCSS. Figure 16
shows the AMP values generated by the two approaches. The AMP values from MCSS are obviously larger than those from SISoik. Checking the occurrence probability maps of single classes, we can reach a similar conclusion: MCSS generates less spatial uncertainty than SISoik.

View larger version (12K):
[in this window]
[in a new window]
|
Fig. 16. Average maximum probabilities estimated from maximum occurrence probability maps generated by Markov chain sequential simulation (MCSS) and sequential indicator simulation with ordinary kriging (SISoik).
|
|
An obvious difference between realizations generated by MCSS and SISoik is that the latter generates dispersed features while the former generates polygon features. While it may be arguable to say which kind of feature better represents soil type spatial distribution, the polygon feature is apparently more useful for area-class mapping and also in accordance with the soil mapping convention.
Looking at Classes 7 and 4 in simulated realizations (and optimal prediction maps) generated by SISoik, one can find that they can be neighbors (see Fig. 14 and 15). This obviously violates their non-neighbor rule defined by sample data and expert knowledge (here the reference map). One-step TPMs can reflect the neighboring relationships in one spatial step (i.e., one pixel) between classes by cross-transition probabilities and the polygon sizes of single classes by autotransition probabilities. Table 3 shows some estimated and simulated TPMs. The first TPM in Table 3 was extracted from the input transiogram models. Comparing it with those estimated from realizations simulated by MCSS and SISoik, one can find that MCSS approximately reproduced the one-step TPM, but SISoik generally decreased autotransition probabilities and exaggerated cross-transition probabilities. Obviously, the poor reproduction capability of SISoik in one-step TPMs is related to the dispersed patterns normally generated in realizations.
From the simulated realizations generated by SISoik conditioned on the sparse data set (see Fig. 15), one cannot find the minor Class 5. In fact, it is not possible for SIS to capture a class missed in sampling because it does not incorporate cross-correlations between classes in simulation.
As a MCRF-based random-path simulation algorithm, MCSS has demonstrated its practicality in simulating soil classes. MCSS obeyed interclass relationships and generated more accurate realizations than SISoik did in our simulation cases. This should be related to two reasons. First, MCSS considers both autocorrelations and interclass relationships in simulations. This means it incorporates more spatial heterogeneity information than SISoik does, because the latter considers only autocorrelations and does not consider interclass relationships. Second, the estimator of MCSS is generally nonlinear (i.e., the nearest known neighbors found in the four search sectors in a simulation process are usually more than one and thus make nonlinear models); however, the OIK estimator is still linear (Bogaert, 2002). Nonlinear estimators should be preferable for simulation of categorical variables because categorical data are nonlinear.
 |
CONCLUSIONS
|
|---|
A MCSS algorithm was developed for conditional simulation and spatial uncertainty analysis of categorical soil variables from random point samples. Its suitability was demonstrated by a case study on three randomly distributed sample data sets of soil types. The algorithm extended Markov chains into a widely applicable geostatistical approach for two-dimensional simulation of categorical variables.
By comparing the simulated results from MCSS and those from the popularly used indicator kriging simulation algorithm SISoik, we found that MCSS had the following advantages:
- The MCSS generated obviously fewer spatial uncertainties associated with simulated results and more accurate simulation results than SISoik did. Average PCCs of realizations from MCSS were 6% or so in absolute values (9% or so in relative values) higher than those from SISoik for normal sampling densities.
- The MCSS captured more effectively the complex patterns of soil classes and obeyed their complex interclass relationships. For example, non-neighboring classes did not occur as neighbors in simulated realizations.
- The MCSS directly generated polygon features in realizations, and this is in accordance with the style of area-class soil maps. The SISoik algorithm, however, generated relatively dispersed patterns in realizations. This makes MCSS a suitable approach for predictive area-class mapping and for spatial uncertainty (i.e., error) estimation of area-class maps.
- The MCSS could generate classes in realizations that were missed in sampling but confirmed by expert knowledge. This is because it considers cross-correlations in simulation. Methods that only consider autocorrelations in simulation do not have such capability.
Although interclass relationships are effectively incorporated, MCSS is still computationally efficient because it does not need to solve complex equation systems. Similar to SIS, MCSS generates realizations in a one-pass way. Since modern geostatistics emphasize stochastic simulation (i.e., getting multiple simulated realizations for uncertainty assessment) rather than interpolation (i.e., getting a single optimal map; Goovaerts, 1997, p. viii; Chiles and Delfiner, 1999), the advantage of MCSS in generating more accurate realizations and class patterns of multinomial classes is especially significant. It can be concluded, therefore, that MCSS provides a powerful spatial statistical tool for prediction and simulation of categorical variables and it is particularly suitable for simulating area-class maps of soil spatial variables.
One practical simulation algorithm based on MCRF was provided here, and there may be other ways of conducting conditional simulations using MCRF. For example, the simulation algorithm proposed in Li and Zhang (2006) for dealing with grid point samples, which suggested first interpolating sample points into a network and then filling in meshes one after another, may simply be applied to MCRF models and potentially may also work with random samples with an intensively developed software tool. Future expansion of MCG and the MCSS algorithm should consider three-dimensional simulation, incorporation of secondary data, nonstationarity, and the time dimension.
In addition, the MCRF theory is also flexible in cardinal directions; three or more than four cardinal directions may be considered in algorithm design. Thus, a more flexible MCSS algorithm with optimal numbers of search sectors for different densities of samples may be proposed in the future.
 |
ACKNOWLEDGMENTS
|
|---|
We thank Dr. Robert Schwartz for his constructive comments.
 |
NOTES
|
|---|
All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Permission for printing and for reprinting the material contained herein has been obtained by the publisher.
Received for publication April 27, 2006.
 |
REFERENCES
|
|---|
- Bierkens, M.F.P., and P.A. Burrough. 1993. The indicator approach to categorical soil data. I. Theory. J. Soil Sci. 44:361368.
- Bogaert, P. 2002. Spatial prediction of categorical variables: The Bayesian maximum entropy approach. Stochastic Environ. Res. Risk Assess. 16:425448.[CrossRef]
- Carle, S.F., and G.E. Fogg. 1997. Modeling spatial variability with one- and multi-dimensional continuous Markov chains. Math. Geol. 29:891918.[CrossRef]
- Chiles, J.-P., and P. Delfiner. 1999. Geostatistics: Modeling spatial uncertainty. John Wiley & Sons, New York.
- Deutsch, C.V. 2006. A sequential indicator simulation program for categorical variables with point and block data: BlockSIS. Comput. Geosci. 32:16691681.[CrossRef]
- Deutsch, C.V., and A.G. Journel. 1998. GSLIB: Geostatistics Software Library and user's guide. Oxford Univ. Press, New York.
- Elfeki, A.M., and F.M. Dekking. 2001. A Markov chain model for subsurface characterization: Theory and applications. Math. Geol. 33:569589.[CrossRef]
- Geman, S., and D. Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6:721741.
- Goovaerts, P. 1996. Stochastic simulation of categorical variables using a classification algorithm and simulated annealing. Math. Geol. 28:909921.[CrossRef]
- Goovaerts, P. 1997. Geostatistics for natural resources evaluation. Oxford Univ. Press, New York.
- Gray, A.J., I.W. Kay, and D.M. Titterington. 1994. An empirical study of the simulation of various models used for images. IEEE Trans. Pattern Anal. Mach. Intell. 16:507513.[CrossRef]
- Journel, A.G. 1983. Nonparametric estimation of spatial distributions. Math. Geol. 15:445468.[CrossRef]
- Li, W. 2006. Transiogram: A spatial relationship measure for categorical data. Int. J. Geogr. Inf. Sci. 20:693699.
- Li, W. 2007a. Markov chain random fields for estimation of categorical variables. Math. Geol. 39: (in press).
- Li, W. 2007b. Transiograms for characterizing spatial variability of soil classes. Soil Sci. Soc. Am. J. (in press).
- Li, W., and C. Zhang. 2005. Application of transiograms to Markov chain simulation and spatial uncertainty assessment of land-cover classes. GISci. Remote Sensing 42:297319.
- Li, W., and C. Zhang. 2006. A generalized Markov chain approach for conditional simulation of categorical variables from grid samples. Trans. GIS 10:651669.[CrossRef]
- Li, W., C. Zhang, J.E. Burt, A.-X. Zhu, and J. Feyen. 2004. Two-dimensional Markov chain simulation of soil type spatial distribution. Soil Sci. Soc. Am. J. 68:14791490.[Abstract/Free Full Text]
- Lin, C., and J.W. Harbaugh. 1984. Graphic display of two- and three-dimensional Markov computer models in geology. Van Nostrand Reinhold Co., New York.
- Luo, J. 1996. Transition probability approach to statistical analysis of spatial qualitative variables in geology. p. 281299. In A. Foster and D.F. Marriam (ed.) Geologic modeling and mapping. Plenum Press, New York.
- Norberg, T., L. Rosen, A. Baran, and S. Baran. 2002. On modeling discrete geological structure as Markov random fields. Math. Geol. 34:6377.[CrossRef]
- Pickard, D.K. 1980. Unilateral Markov fields. Adv. Appl. Probab. 12:655671.[CrossRef]
- Qian, W., and D.M. Titterington. 1991. Multidimensional Markov chain models for image textures. J. R. Stat. Soc. Ser. B 53:661674.
- Ritzi, R.W. 2000. Behavior of indicator variograms and transition probabilities in relation to the variance in lengths of hydrofacies. Water Resour. Res. 36:33753381.[CrossRef]
- Wu, K., N. Nunan, J.W. Crawford, I.M. Young, and K. Ritz. 2004. An efficient Markov chain model for the simulation of heterogeneous soil structure. Soil Sci. Soc. Am. J. 68:346351.[Abstract/Free Full Text]