|
|
||||||||
a Macaulay Land Use Research Institute, Craigiebuckler, Aberdeen AB15 8QH, Scotland, UK
b Univ. of Maryland, Dep. of Plant Science and Landscape Architecture, 2102 Plant Science Building, College Park, MD 20742
c USDA-ARS Crop Systems and Global Change Lab., 10300 Baltimore Ave., Bldg. 001, BARC-West, Beltsville, MD 20705
d USDA-ARS Hydrology and Remote Sensing Lab., 10300 Baltimore Ave., Bldg. 007, BARC-West, Beltsville, MD 20705
e USDA-ARS Environmental Microbial Safety Lab., Powder Mill Rd., Bldg. 173, BARC-East, Beltsville, MD 20705
* Corresponding author (a.lilly{at}macaulay.ac.uk).
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: HOR, topsoil or subsoil distinction OM, organic matter PED, ped size information PS, ped size class PSD, particle-size distribution PTF, pedotransfer function RMSR, root mean squared residual TXT, texture class
| INTRODUCTION |
|---|
|
|
|---|
imunek et al., 2005), the CoupModel (Jansson and Karlberg, 2004), and WAVE (Vanclooster et al., 1994). There is often insufficient soil hydrologic data available to parameterize these models, however, particularly for studies at the catchment scale or greater. Even simple water balance models (Dunn et al., 2004) require information about soil moisture retention that is not always readily available. As these soil hydrologic properties are often difficult, costly, and time consuming to collect, pedotransfer functions (PTFs) have been developed to overcome this lack of data. Many of these PTFs are statistical regression equations between the required soil hydrologic data and more readily measured properties such as particle-size distribution, bulk density (Db), and organic matter (OM) content (e.g., Wösten et al., 1999; Saxton and Rawls, 2006). Recently, other techniques, such as artificial neural networks, have been used (e.g., Minasny and McBratney, 2002; Schaap et al., 1998, 2001) to develop PTFs. Pachepsky and Rawls (2004) and Cornelis et al. (2001) included descriptions and evaluations of many of these techniques. Pedotransfer functions that are based on the statistical relationship between soil hydrologic properties and soil texture alone are often not readily transferable to other climatic zones (Wösten et al., 2001; O'Connell and Ryan, 2002), as soils with similar textures do not necessarily develop the same soil structure under different moisture or thermal regimes. Wagner et al. (1998) reported that PTFs based only on soil texture gave poor results in structured soils. As soil structure (the organization of soil particles into aggregates or peds) is a function of the interaction between soil texture and climate, PTFs that incorporate soil structure in the prediction of soil hydrologic properties should have a greater degree of transferability than regional, texture-based models.
Soil morphology and soil structure have long been used to infer soil hydraulic properties (O'Neal, 1949; King and Franzmeier, 1981; McKeague et al., 1982; Coen and Wang, 1989; Boorman et al., 1995; Lin et al., 1999; Rawls and Pachepsky, 2002). Soil structure, as described by soil surveyors, provides information on macroporosity and the dominant pathways of water movement through the soil under saturated and near-saturated conditions. It is known to have significant impact on soil hydraulic properties (Lilly and Lin, 2004), but is rarely represented directly in soil hydraulic PTFs with some exceptions (Rawls and Pachepsky, 2002; Pachepsky and Rawls, 2003). Most PTFs use input parameters that are indirectly related to soil structure, such as Db, OM content, and topsoil vs. subsoil distinctions. No clear recommendation exists on what structural indicators might be the most significant in relation to the estimation of soil hydraulic properties. As morphological descriptions of soil structure are often routinely collected during field sampling, it is useful to explore the possibility of utilizing these data and to identify if any of these categorical data can be used to improve PTFs. If this proves to be possible, it will capitalize on this vast store of data and greatly improve the value of these morphological databases.
The objective of this study was to use regression trees to examine which types of soil texture- and structure-related categorical data provide the most useful information for estimating saturated hydraulic conductivity (Ks) and thus could be used either to improve current PTFs or to allow PTFs to be developed in areas where measured particle size distribution, OM content, and Db are lacking.
| MATERIALS AND METHODS |
|---|
|
|
|---|
As ped size classes vary according to the structure type, the size range of peds (millimeters) was used instead, as Ks is expected to be influenced by the distance between macropores. Soil structure type was reclassified according to the orientation of the cracks between peds: horizontal, vertical, or both (for example, blocky or granular structures).
Table 1 shows the variables that were used in this study, and the number of samples in each of these classes. Input variables were grouped into six groups. Group 1 (HOR) indicated whether the particular sample was from topsoil or subsoil. Group 2 (PED) indicated ped size, and those cases where ped size was not recorded but ped type was. Group 3 (CRK) indicated the orientation of structural cracks (if present) or the classification of apedal soils into three groups. The following group (Group 4) indicated to which of the 12 USDA texture classes the particular sample belongs (TXT). Groups 5 (PSD) and 6 (BDO) represented the quantitative variables that are most commonly used in the estimation of soil hydraulic properties. Group 5 represents particle-size distribution according to the USDA and FAO system (<5, 2–50, and 50–2000 µm), while the last group contains Db and OM content values. Each categorical variable was quantified by applying 1 for presence and 0 for absence before statistical analyses. Table 2 shows summary statistics of the data set used in this study in terms of the quantitative input and output attributes.
|
|
The regression tree algorithm used in our study was coded in MATLAB 5.2 (The MathWorks, 1996). Development of a tree model requires a criterion to halt further partitioning of the data to be set. Optimal estimations are rarely reached using the full tree model, therefore another criterion—a pruning factor—is usually set to avoid overfitting. While general recommendations for such settings of regression trees exist, we found that such settings are often specific to the data sets, and instead, used part of the data set to find the optimal settings for this task. The ratio of the development and test data set sizes, however, can influence the efficiency of the final model and so a balance has to be found between these data sets. We, therefore, simultaneously optimized (i) the ratio between development and test data set sizes, (ii) the maximum number of samples that remained unpartitioned, and (iii) the size of the (pruned) tree that gave the most accurate estimation for the test data set. We used a trial-and-error approach to optimize and developed regression trees using different combinations of these factors. The size of the development data set was varied between 21 and 491 in steps of 10. We used the jackknife cross-validation (randomized subset selection without replacement or "bagging") to generate each of those data sets (Good, 1999). For each development data set, 100 alternative subsets were generated, which subsequently allowed probability estimates of the partitioning and pruning process. The maximum allowed node size before mandatory partitioning varied between 20 and 300, also in steps of 10. This provided 1392 different tree structures, each with 100 replicates. In the optimization procedure, the above data set and all available variables were used as input to estimate the log10-transformed Ks. Tree development was stopped when all terminal node sizes became smaller than the allocated maximum value. Estimations were made for all the samples in the test data set in each of the 100 replicates, first using the full tree models. Root mean squared residuals (RMSRs) were calculated over the 100 replicates, where RMSR is defined as
![]() | [1] |
This analysis showed that using 411 samples in the development data set and the remaining 91 in the test data set (i.e., a 82/18% split), stopping tree development when the size of the largest node fell below 90, and pruning each resulting tree to five terminal nodes provided the most optimal settings for regression tree development. We combined these settings with the jackknife cross-validation to once again generate 100 alternative data set realizations, which subsequently allowed probability estimates of the partitioning made by the input variables.
Comparison of Different Input Levels Using the Optimized Tree Settings
Six groups of input variables were derived (Table 1). Initially all these inputs were used to optimize settings for regression tree development and testing. We used both texture class and particle-size data despite their obvious overlap, and allowed the model to choose whichever variable it found most useful.
To find the importance of each group of input variables in the estimation of Ks, we systematically eliminated groups of input variables and repeatedly developed the tree models. Typically, this would be done by first eliminating the group that gave the least improvement to the model. We did not follow this approach, however, as differences between models were so small (and insignificant) that eliminating one group and continuing from there could lead to a false local minimum on the error surface, which does not provide the best solution for the next step. Also, the aim was to identify as wide a range of input data as possible that could be used for the development of PTFs. Therefore, to find the importance of each group of input variables (Table 1), we tried all combinations of the input variable groups, from using only one group up to and including all six groups.
Following the identification of the input variable combinations that probably provide the best estimation results, we analyzed the structure of the most likely tree models and the input variables that provided the most useful information toward estimating Ks. We also derived an estimate of Ks by a tree model that used popular and widely available input attributes.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
The outcome of this analysis is shown in Table 3 . Each row shows the list of input groups used, the mean RMSR obtained by averaging 100 RMSR values (from the 100 replicates), the SD of the main RMSR value, and the minimum and maximum RMSRs obtained for a single replicate. Table 3 shows the results in ascending order of the mean RMSR. The result using the full model (that is, all input data) is shown at Rank 20. The RMSR of a single replicate varied between 0.8338 and 1.1396 (average = 0.9696, SD = 0.0713). This implies that, on average, the estimation error was less than one order of magnitude when Ks was back-transformed to centimeters per day. This accuracy does not exceed that of existing PTFs; however, those that perform better are generally continuous PTFs, whereas regression tree models can be considered to be class PTFs. The estimation accuracy obtained, however, compares well with some continuous PTFs (e.g., Schaap and Leij, 2000; Schaap et al., 2001).
|
Considering the small differences in mean RMSR, it can be argued that those models in the top quarter of the list do not provide any significantly better results than those in the second quarter. Also, any of those models in the same upper portion of the table that use fewer input variables provide as good results as other models in the same portion or below that use more input variables. The best model from within each group based on the number of input variables (one to six) are shown in Table 3 (marked with
). The best model that used five input variable groups (Rank 5) did not use Db or OM content, and provided better results than the model that used all available inputs. The best model that used only four input variable groups (Rank 3) omitted textural classification as input but used particle size class (proportion of sand, silt, and clay). The best model that used only three groups of inputs was the best model overall and used HOR, ped size information (PED), and textural classification (TXT). Using only two variable groups resulted in some loss of estimation accuracy, with the best of such models being ranked in 10th place and using only PED and TXT, that is, only qualitative type input information. There were several combinations of three or four input variable groups that yielded as good results as the model that used all six groups of input variables. Most notable is the best model overall (Rank 1, Table 3). Other notable models are those ranked 6 and 8, that used four and three variable groups, respectively, but only used information that does not require laboratory measurements. These attributes include HOR, hand texture (TXT), and morphological descriptions of soil structure (such as those in groups PED and CRK) and can all be determined in the field. Such information is often collected during routine profile description and is therefore available in most existing soils databases. Another notable model is that ranked 30. This model used only HOR and TXT, and is among the best models that used only two input variable groups. The accuracy of estimations by this model is not significantly worse than those made by the overall best model (Table 3). The importance of this model is that all the necessary information can be collected with minimum effort in the field, for example, with the aid of a soil auger. There was no significant difference in mean RMSRs between any of the above-mentioned cases.
In summary, it was shown that using the regression tree technique, the estimation of Ks was possible using only qualitative texture- and structure-related soils information that generally resides in most national and regional soil survey databases, without losing estimation accuracy.
Identification of Tree Structures and Potentially Useful Input Attributes
Four of the models were selected for further discussion (underlined in Table 3). The model ranked 20, which used data from all available input variable groups, was used to identify the most important individual input fields among those that were available. The remaining models are: the model ranked 1 (that is, the best overall model and one that used only qualitative inputs), the model ranked 29 (the best model that used only structure-related inputs) and the model ranked 30, which is the model that used the most readily determined attributes (HOR and TXT) and still performed well.
As already indicated, the best performing pruned trees had around five terminal nodes. Therefore, we used five (or fewer) terminal nodes in these models that are further discussed. As the theoretical maximum number of terminal nodes at each level, m, is 2m, five terminal nodes can be obtained using either three or four partition levels, depending on the position of the partitions. For this reason, we only discuss the branching of trees to a maximum of four levels.
The probabilities of individual input fields appearing at the particular level of partitioning when using all possible inputs are shown in Table 4 . The last column shows the probability of the particular input appearing at least once at this level in the tree model. The penultimate column shows at what frequency input variables appeared at a particular level. As the same variable may appear more than once at the same level but on different branches, the probability that a variable will be partitioned can exceed 1.
|
In 85% of cases, partitioning of all other samples at the second level was based on the largest ped size class (PS7). In all such cases, the primary partitioning variable was sand content. Topsoil or subsoil distinction (HOR) and sand content were other important variables at this level, with 8 and 6% probabilities, respectively.
Organic matter content, Db, and HOR appeared most frequently as partitioning variables at the tertiary level, with frequencies of 49, 21, and 14%, respectively. These properties appear to be closely related to each other. Topsoil would typically have greater OM content and lesser Db than subsoil. Greater biological activity by both plants and animals will generally lead to the development of more stable aggregates and smaller peds in topsoil and, therefore, to larger Ks values. In a total of 15% of all cases, the largest ped size class (PS7), silt and clay content, and cracking in both directions (BOTH) appeared as partitioning factors at this tertiary level. The dominance of a single variable diminished at the fourth partitioning level, with 13 variables involved in the partitioning of the data set. Variables that appear infrequently at a higher level typically become very frequent partitioning variables at the next level.
In summary, coarse-textured horizons and those with the largest ped sizes appear to be the most important factors in delineating groups of soils with similar Ks. The variability of Ks within the rest of the soil horizons appears to be so large that it is more difficult to isolate different groups from each other, as shown by the large variety of partitioning variables at lower levels. A second indication of large within-group variability is that fewer input variables are sufficient to provide the same estimation accuracy by the tree models (Table 3). Such models are discussed next.
Tree Structures and Potentially Useful Input Attributes Derived Using Fewer Input Variables
The best model used only HOR, PED, and TXT as inputs. Unsurprisingly, the most important partitioning variables in that model were identical to, or representative of, those in the model that used all six input variable groups. Measured sand content did not occur in this model, but the primary partitioning variable was the sand texture class in 77% of all replicates. The largest ped size class (PS7, peds >100 mm) was the primary partitioning variable in the remaining 23% of cases. These two variables were the most important secondary partitioning variables when the other was the primary partitioning variable. Topsoil or subsoil distinction (HOR) was the third most frequent variable that appeared at the secondary level (6% of cases). There was only one partition at the secondary level in all replicates, as the coarse (sand) textured soils and soils with the largest peds that were separated at the first level were not partitioned any further (Table 5
). The horizon being topsoil or subsoil was by far the most important tertiary partitioning factor, appearing in 87% of models and providing 92% of all variables that appeared at this level. Three different texture classes, mostly delineating fine-textured soils, and four different ped size classes appeared at this level at a probability of 6% or less. Different texture classes as well as ped size classes appeared at the fourth level. The most dominant partitioning variables at this level were fine texture classes and the smallest ped size classes. The most likely tree model to appear, using these input groups with five terminal nodes, is similar to that shown in Fig. 1
.
|
|
Two more models of interest are the models that use (i) HOR and PED only and (ii) HOR and TXT only. The first of these two models was the second best tree model in our study that used only two variable groups as input (Table 3). Those two groups consisted of only qualitative variables that could be determined directly in the field. This model was, on average, less accurate than the full model that used all groups of input variables, but the difference was not significant. Table 6 and Fig. 2 show the list, rank, and importance of input variables for this model. Three variables dominated the first two levels: the smallest and largest ped size classes (PS1 and PS7, respectively) and HOR. While we have seen HOR and PS7 as important partitioning variables in the models discussed above, PS1 has not appeared at the primary or secondary levels before. The largest ped-size class (PS7) was the main partitioning variable in all but one case, and the smallest ped size class was the dominant partitioning variable at the secondary level (98%). For all non-PS7 and -PS1 horizons, HOR became the dominant tertiary partition (97%). The variables PS1 and PS7 did not induce any further partitions in the data. All but two ped size classes (PS1 and PS7) appeared at the fourth level, as those two had already appeared at a higher level in all replicates.
|
|
Another model that performed well using only two input groups has similar estimation accuracy to that above. Its significance is that it used only HOR and TXT as input. Table 7 shows the list, ranking, and importance of input variables for this model and Fig. 3 shows the most likely tree model to appear using these inputs. The top two levels of partitioning occur due to differences in Ks among the sand texture class (first level) and the HOR variables (second level). In this case, however, no other variable appeared at the first or second levels in any of the replicates. At the third level, topsoil and subsoil were further partitioned in all replicates, meaning that in all cases three levels were enough to obtain five terminal nodes on the tree. Most frequently, silty clay texture was the partitioning variable for subsoil (89%) and clay loam or clay for topsoil (41 and 28% respectively), meaning that finer textured soils were separated from all others.
|
|
As expected, topsoils had faster Ks values than subsoils. Typically, more coarse-textured soils had large Ks values, with the exception of clay vs. nonclay topsoils. The mean Ks value for the node is relatively large and significantly greater than for any other non-sand groups in the same tree. While this appears unexpected based solely on the texture of those soils, it can be explained by considering the soil structure. Most clay-textured soils in this data set displayed both horizontal and vertical cracking (67%) and 75% had aggregates between 5 and 50 mm, allowing water to flow through the many structural cracks. Average RMSRs for the test data set were slightly less (0.9711 if clay loam and 0.9702 if clay) than the corresponding values shown in Table 3. This may be due to the fact that only the most dominant model structures were used; however, the difference was not significant.
| CONCLUSIONS |
|---|
|
|
|---|
We used regression trees to explore the potential for developing class PTFs to predict Ks. While the ability of PTFs to accurately estimate soil properties is often closely linked to the underlying data, the resampling technique used in this study enabled us to simulate the likely outcome had a number of different data sets been used, making the results more widely applicable.
The most accurate regression tree model used only horizon designation, ped size, and soil texture class as input variables, all of which require no laboratory measurement and are widely available in soil morphological databases. Interestingly, in the majority of cases, the major partition associated with these input variables was both textural (sand texture) and structural (massive structures or peds >100 mm), which separated those soils with fast conductivities in coarse textures from those with slower conductivities with dominantly macropore flow through few widely spaced pores.
Although there is large heterogeneity observed in Ks values even among soils with similar properties, it was found that five terminal nodes was the optimum number of partitions for this data set. Although further partitions were possible, the development of larger trees did not appear to improve the predictive capability of this technique. Instead, further partitioning may simply uncover data structures within the development data set that do not exist within any other test or application data sets, leading to increasing estimation errors. This may also highlight a limitation in using the fairly crude estimator of ped size. Estimates of biopore volume, for example, may help in the development of more robust partitioning. Also, this approach may benefit from the use of novel techniques and collection of new types of information related to pore structure. Until such new information becomes available, however, the performance of this modeling approach suggests that existing qualitative or structure-related input data can be successfully used to estimate Ks.
Despite the relatively few homogenous groups identified by the regression tree approach, there is a considerable reduction in the uncertainty of the estimations within each group. While using the data set mean would have an uncertainty of more than an order of magnitude for this data set, the uncertainty in the estimates for the tree nodes was reduced to less than 0.2 of an order of magnitude, and in most cases under 0.1. The practical benefit of developing pedotransfer functions to predict soil hydraulic conductivity is in providing input data for soil water and solute transport modeling. Modeling studies increasingly focus on estimating the uncertainty of the results along with the main output; therefore, large uncertainty in Ks would propagate in the simulation models to yield final outputs that have undesirably wide confidence intervals. Our models help reduce uncertainty, and they do so by using inexpensive, widely available, and often undervalued soil information.
Overall, we have shown that simple, routinely collected qualitative soil attributes such as texture and structure that are often underutilized can be used to estimate saturated hydraulic conductivity and thereby support simulation modeling.
| ACKNOWLEDGMENTS |
|---|
| NOTES |
|---|
|
|
|---|
Received for publication November 11, 2006.
| REFERENCES |
|---|
|
|
|---|
imunek, J., M.Th. van Genuchten, and M. Sejna. 2005. The HYDRUS-1D software package for simulating the movement of water, heat, and multiple solutes in variably saturated media. Version 3.0. HYDRUS Software Ser. 1. Dep. of Environ. Sci., Univ. of California, Riverside.This article has been cited by other articles:
![]() |
A. Nemes, D. J. Timlin, Ya. A. Pachepsky, and W. J. Rawls Evaluation of the Rawls et al. (1982) Pedotransfer Functions for their Applicability at the U.S. National Scale Soil Sci. Soc. Am. J., August 19, 2009; 73(5): 1638 - 1645. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. P. Martin, D. Lo Seen, L. Boulonne, C. Jolivet, K. M. Nair, G. Bourgeon, and D. Arrouays Optimizing Pedotransfer Functions for Estimating Soil Bulk Density Using Boosted Regression Trees Soil Sci. Soc. Am. J., March 1, 2009; 73(2): 485 - 493. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Crop Science | |||
| Journal of Natural Resources and Life Sciences Education |
Vadose Zone Journal | ||||
| Journal of Plant Registrations | Journal of Environmental Quality |
The Plant Genome | |||