SSSAJ Journal of Natural Resources and Life Sciences Education
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Published online 11 January 2008
Published in Soil Sci Soc Am J 72:16-24 (2008)
DOI: 10.2136/sssaj2006.0391
© 2008 Soil Science Society of America
677 S. Segoe Rd., Madison, WI 53711 USA
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Lilly, A.
Right arrow Articles by Pachepsky, Ya. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Lilly, A.
Right arrow Articles by Pachepsky, Ya. A.
Agricola
Right arrow Articles by Lilly, A.
Right arrow Articles by Pachepsky, Ya. A.
Related Collections
Right arrow Hydraulic Conductivity
Right arrow Soil Hydrology
Right arrow Soil Physics

SOIL PHYSICS

Probabilistic Approach to the Identification of Input Variables to Estimate Hydraulic Conductivity

A. Lillya,*, A. Nemesb,c, W. J. Rawlsd and Ya. A. Pachepskye

a Macaulay Land Use Research Institute, Craigiebuckler, Aberdeen AB15 8QH, Scotland, UK
b Univ. of Maryland, Dep. of Plant Science and Landscape Architecture, 2102 Plant Science Building, College Park, MD 20742
c USDA-ARS Crop Systems and Global Change Lab., 10300 Baltimore Ave., Bldg. 001, BARC-West, Beltsville, MD 20705
d USDA-ARS Hydrology and Remote Sensing Lab., 10300 Baltimore Ave., Bldg. 007, BARC-West, Beltsville, MD 20705
e USDA-ARS Environmental Microbial Safety Lab., Powder Mill Rd., Bldg. 173, BARC-East, Beltsville, MD 20705

* Corresponding author (a.lilly{at}macaulay.ac.uk).


    ABSTRACT
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Soil hydrologic data are required for catchment-scale modeling but these data are often difficult and costly to obtain. Although pedotransfer functions (PTFs) have been used to generate these data, they are not easily transferable to other bioclimatic zones. As climate influences the development of soil structure, the incorporation of soil structure assessments may improve the effectiveness of pedotransfer functions. The objective of this study was to examine which types of categorical texture and structure data would be most useful in either improving current PTFs to estimate saturated hydraulic conductivity (Ks) or allowing PTFs to be developed in areas where measured particle-size distribution, organic matter (OM) content, and bulk density (Db) are lacking. As soil structure is categorical data, regression trees were used to determine which input data derived from the HYPRES database would be most useful in deriving new PTFs. Jackknife cross-validation was used to generate randomized subsets of the data and the optimal size of the developmental (n = 411) and test (n = 91) data sets was derived experimentally. The relative importance of input variables was evaluated by considering the probability that the data were partitioned by each variable. The best model utilized field-based information on soil horizon, soil structure (ped size), and soil textural class and, although the accuracy was no better than existing continuous PTFs, it has the added benefit of utility in data-poor environments.

Abbreviations: HOR, topsoil or subsoil distinction • OM, organic matter • PED, ped size information • PS, ped size class • PSD, particle-size distribution • PTF, pedotransfer function • RMSR, root mean squared residual • TXT, texture class


    INTRODUCTION
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Soil hydrologic data such as soil moisture retention and hydraulic conductivity are basic requirements of many soil water flow models such as HYDRUS (Simunek et al., 2005), the CoupModel (Jansson and Karlberg, 2004), and WAVE (Vanclooster et al., 1994). There is often insufficient soil hydrologic data available to parameterize these models, however, particularly for studies at the catchment scale or greater. Even simple water balance models (Dunn et al., 2004) require information about soil moisture retention that is not always readily available. As these soil hydrologic properties are often difficult, costly, and time consuming to collect, pedotransfer functions (PTFs) have been developed to overcome this lack of data. Many of these PTFs are statistical regression equations between the required soil hydrologic data and more readily measured properties such as particle-size distribution, bulk density (Db), and organic matter (OM) content (e.g., Wösten et al., 1999; Saxton and Rawls, 2006). Recently, other techniques, such as artificial neural networks, have been used (e.g., Minasny and McBratney, 2002; Schaap et al., 1998, 2001) to develop PTFs. Pachepsky and Rawls (2004) and Cornelis et al. (2001) included descriptions and evaluations of many of these techniques.

Pedotransfer functions that are based on the statistical relationship between soil hydrologic properties and soil texture alone are often not readily transferable to other climatic zones (Wösten et al., 2001; O'Connell and Ryan, 2002), as soils with similar textures do not necessarily develop the same soil structure under different moisture or thermal regimes. Wagner et al. (1998) reported that PTFs based only on soil texture gave poor results in structured soils. As soil structure (the organization of soil particles into aggregates or peds) is a function of the interaction between soil texture and climate, PTFs that incorporate soil structure in the prediction of soil hydrologic properties should have a greater degree of transferability than regional, texture-based models.

Soil morphology and soil structure have long been used to infer soil hydraulic properties (O'Neal, 1949; King and Franzmeier, 1981; McKeague et al., 1982; Coen and Wang, 1989; Boorman et al., 1995; Lin et al., 1999; Rawls and Pachepsky, 2002). Soil structure, as described by soil surveyors, provides information on macroporosity and the dominant pathways of water movement through the soil under saturated and near-saturated conditions. It is known to have significant impact on soil hydraulic properties (Lilly and Lin, 2004), but is rarely represented directly in soil hydraulic PTFs with some exceptions (Rawls and Pachepsky, 2002; Pachepsky and Rawls, 2003). Most PTFs use input parameters that are indirectly related to soil structure, such as Db, OM content, and topsoil vs. subsoil distinctions. No clear recommendation exists on what structural indicators might be the most significant in relation to the estimation of soil hydraulic properties. As morphological descriptions of soil structure are often routinely collected during field sampling, it is useful to explore the possibility of utilizing these data and to identify if any of these categorical data can be used to improve PTFs. If this proves to be possible, it will capitalize on this vast store of data and greatly improve the value of these morphological databases.

The objective of this study was to use regression trees to examine which types of soil texture- and structure-related categorical data provide the most useful information for estimating saturated hydraulic conductivity (Ks) and thus could be used either to improve current PTFs or to allow PTFs to be developed in areas where measured particle size distribution, OM content, and Db are lacking.


    MATERIALS AND METHODS
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Input Data
Data from the Hydraulic Properties of European Soils (HYPRES) database (Wösten et al., 1999) was used in this study. This database comprises data from 12 European Union member states and has data on 95 different soil types (classification according to the Commission of the European Communities, 1985). There are 5521 samples (including replicates) from 4486 soil horizons of 1777 spatially referenced soil profiles. The database holds information on water retention, hydraulic conductivity, soil particle size distribution (PSD), soil OM content, morphological descriptions of soil structure, and horizon designation for a large number of soil horizons. These soil structure assessments comprise categorical information on the shape, size, and grade or distinctness of peds (Soil Survey Division Staff, 1993) and indicate the frequency of macropores within the horizon. With a few exceptions, only those horizons that had information on all of these characteristics and where the OM contents were <12% (that is, mineral or humose horizons only) were used in this study. The input data set also had 86 horizons where the ped size was not known but the structure type had been recorded and 35 horizons that were described as being apedal but with no indication as to whether these were massive or single grain. The Ks values were directly measured or derived by methods that require fitting a parameter set to experimentally derived data (for example, the wind evaporation method). This selection procedure resulted in a data set with 502 records.

As ped size classes vary according to the structure type, the size range of peds (millimeters) was used instead, as Ks is expected to be influenced by the distance between macropores. Soil structure type was reclassified according to the orientation of the cracks between peds: horizontal, vertical, or both (for example, blocky or granular structures).

Table 1 shows the variables that were used in this study, and the number of samples in each of these classes. Input variables were grouped into six groups. Group 1 (HOR) indicated whether the particular sample was from topsoil or subsoil. Group 2 (PED) indicated ped size, and those cases where ped size was not recorded but ped type was. Group 3 (CRK) indicated the orientation of structural cracks (if present) or the classification of apedal soils into three groups. The following group (Group 4) indicated to which of the 12 USDA texture classes the particular sample belongs (TXT). Groups 5 (PSD) and 6 (BDO) represented the quantitative variables that are most commonly used in the estimation of soil hydraulic properties. Group 5 represents particle-size distribution according to the USDA and FAO system (<5, 2–50, and 50–2000 µm), while the last group contains Db and OM content values. Each categorical variable was quantified by applying 1 for presence and 0 for absence before statistical analyses. Table 2 shows summary statistics of the data set used in this study in terms of the quantitative input and output attributes.


View this table:
[in this window]
[in a new window]

 
Table 1. Description and grouping of input variables and of the output variable.

 

View this table:
[in this window]
[in a new window]

 
Table 2. Summary statistics of the quantitative soils data used in this study.

 
Regression Tree Modeling
Regression tree modeling is an exploratory technique that uncovers structure in data by partitioning data first into two groups; each group is then further subdivided into two subgroups, providing groups as homogeneous as possible at each of the levels (Clark and Pregibon, 1992). Each partition can be viewed as a branching. Regression trees have been used in the estimation of soil properties by van Lanen et al. (1992), McKenzie and Jacquier (1997), McKenzie and Ryan (1999), Rawls and Pachepsky (2002), and Rawls et al. (2003) and can use both categorical and numerical variables as predictors (Breiman et al., 1993).

The regression tree algorithm used in our study was coded in MATLAB 5.2 (The MathWorks, 1996). Development of a tree model requires a criterion to halt further partitioning of the data to be set. Optimal estimations are rarely reached using the full tree model, therefore another criterion—a pruning factor—is usually set to avoid overfitting. While general recommendations for such settings of regression trees exist, we found that such settings are often specific to the data sets, and instead, used part of the data set to find the optimal settings for this task. The ratio of the development and test data set sizes, however, can influence the efficiency of the final model and so a balance has to be found between these data sets. We, therefore, simultaneously optimized (i) the ratio between development and test data set sizes, (ii) the maximum number of samples that remained unpartitioned, and (iii) the size of the (pruned) tree that gave the most accurate estimation for the test data set. We used a trial-and-error approach to optimize and developed regression trees using different combinations of these factors. The size of the development data set was varied between 21 and 491 in steps of 10. We used the jackknife cross-validation (randomized subset selection without replacement or "bagging") to generate each of those data sets (Good, 1999). For each development data set, 100 alternative subsets were generated, which subsequently allowed probability estimates of the partitioning and pruning process. The maximum allowed node size before mandatory partitioning varied between 20 and 300, also in steps of 10. This provided 1392 different tree structures, each with 100 replicates. In the optimization procedure, the above data set and all available variables were used as input to estimate the log10-transformed Ks. Tree development was stopped when all terminal node sizes became smaller than the allocated maximum value. Estimations were made for all the samples in the test data set in each of the 100 replicates, first using the full tree models. Root mean squared residuals (RMSRs) were calculated over the 100 replicates, where RMSR is defined as

Formula 1[1]
where n is the number of samples in the data set, and log(Ks i,meas) and log(Ks i,est) are log10-transformed measured and estimated Ks values, respectively. Trees were then subject to pruning, that is, branch partitions (or nonterminal nodes) were collapsed one by one in the order that resulted in the least nonhomogeneity, i.e., adding the least deviance (D) to the remaining tree model (based on the development data) and estimations were again made using the pruned trees. Pruning was continued until we obtained only one partition, that is, two terminal nodes. The RMSRs were then averaged over the 100 replicates for each development data set size, for the maximum terminal node size, and for each terminal node count. The size of the full tree varied from replicate to replicate, depending on the actual development data set. We then ordered the matrix of information that was generated by the average RMSR plus one standard deviation and examined the running average of each of the three optimized variables of the best performing models. We used the mean RMSR + 1 SD as a criterion and the running average to eliminate outliers.

This analysis showed that using 411 samples in the development data set and the remaining 91 in the test data set (i.e., a 82/18% split), stopping tree development when the size of the largest node fell below 90, and pruning each resulting tree to five terminal nodes provided the most optimal settings for regression tree development. We combined these settings with the jackknife cross-validation to once again generate 100 alternative data set realizations, which subsequently allowed probability estimates of the partitioning made by the input variables.

Comparison of Different Input Levels Using the Optimized Tree Settings
Six groups of input variables were derived (Table 1). Initially all these inputs were used to optimize settings for regression tree development and testing. We used both texture class and particle-size data despite their obvious overlap, and allowed the model to choose whichever variable it found most useful.

To find the importance of each group of input variables in the estimation of Ks, we systematically eliminated groups of input variables and repeatedly developed the tree models. Typically, this would be done by first eliminating the group that gave the least improvement to the model. We did not follow this approach, however, as differences between models were so small (and insignificant) that eliminating one group and continuing from there could lead to a false local minimum on the error surface, which does not provide the best solution for the next step. Also, the aim was to identify as wide a range of input data as possible that could be used for the development of PTFs. Therefore, to find the importance of each group of input variables (Table 1), we tried all combinations of the input variable groups, from using only one group up to and including all six groups.

Following the identification of the input variable combinations that probably provide the best estimation results, we analyzed the structure of the most likely tree models and the input variables that provided the most useful information toward estimating Ks. We also derived an estimate of Ks by a tree model that used popular and widely available input attributes.


    RESULTS AND DISCUSSION
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Comparison of Different Input Levels Using the Optimized Tree Settings
To assess the importance of each input variable group (Table 1), we systematically eliminated groups of input variables, beginning with all six, and repeatedly redeveloped the tree models using the optimized settings. This method of successively removing one input variable group resulted in 63 unique combinations. We then redeveloped the tree models 100 times in each case, using 100 sets of development and test data sets generated by randomized subset selection. The tree models were subsequently pruned to five terminal nodes where necessary. If the model had fewer than five terminal nodes, we used the model that had the nearest number of terminal nodes.

The outcome of this analysis is shown in Table 3 . Each row shows the list of input groups used, the mean RMSR obtained by averaging 100 RMSR values (from the 100 replicates), the SD of the main RMSR value, and the minimum and maximum RMSRs obtained for a single replicate. Table 3 shows the results in ascending order of the mean RMSR. The result using the full model (that is, all input data) is shown at Rank 20. The RMSR of a single replicate varied between 0.8338 and 1.1396 (average = 0.9696, SD = 0.0713). This implies that, on average, the estimation error was less than one order of magnitude when Ks was back-transformed to centimeters per day. This accuracy does not exceed that of existing PTFs; however, those that perform better are generally continuous PTFs, whereas regression tree models can be considered to be class PTFs. The estimation accuracy obtained, however, compares well with some continuous PTFs (e.g., Schaap and Leij, 2000; Schaap et al., 2001).


View this table:
[in this window]
[in a new window]

 
Table 3. Performance of tree models, in terms of estimation of the root mean squared residual (RMSR), using 63 different combinations of input variable groups. Models selected for further analysis are underlined.

 
The overall mean RMSRs did not differ greatly among many of the input combinations (Table 3). The difference between the best and worst models is 0.128 log10(cm d–1). The same gradual change can be seen in the minimum and maximum observed RMSRs per replicate. The model that utilized all the available input variable groups did not perform the best. It can be argued that the use of the PSD and TXT groups will lead to overparameterization of the model; however, even omitting one of those groups did not result in the best solution, with these models being ranked 14 and 18 (Table 3). All of the best models used the HOR qualifier and ped-size classes as input. Information about the presence of structural cracks and apedality appeared in about half of the best models and, in most cases, particle-size distribution and textural classification appeared interchangeably. Most interestingly, Db and OM content did not feature in the best six models.

Considering the small differences in mean RMSR, it can be argued that those models in the top quarter of the list do not provide any significantly better results than those in the second quarter. Also, any of those models in the same upper portion of the table that use fewer input variables provide as good results as other models in the same portion or below that use more input variables. The best model from within each group based on the number of input variables (one to six) are shown in Table 3 (marked with {dagger}). The best model that used five input variable groups (Rank 5) did not use Db or OM content, and provided better results than the model that used all available inputs. The best model that used only four input variable groups (Rank 3) omitted textural classification as input but used particle size class (proportion of sand, silt, and clay). The best model that used only three groups of inputs was the best model overall and used HOR, ped size information (PED), and textural classification (TXT). Using only two variable groups resulted in some loss of estimation accuracy, with the best of such models being ranked in 10th place and using only PED and TXT, that is, only qualitative type input information. There were several combinations of three or four input variable groups that yielded as good results as the model that used all six groups of input variables. Most notable is the best model overall (Rank 1, Table 3). Other notable models are those ranked 6 and 8, that used four and three variable groups, respectively, but only used information that does not require laboratory measurements. These attributes include HOR, hand texture (TXT), and morphological descriptions of soil structure (such as those in groups PED and CRK) and can all be determined in the field. Such information is often collected during routine profile description and is therefore available in most existing soils databases. Another notable model is that ranked 30. This model used only HOR and TXT, and is among the best models that used only two input variable groups. The accuracy of estimations by this model is not significantly worse than those made by the overall best model (Table 3). The importance of this model is that all the necessary information can be collected with minimum effort in the field, for example, with the aid of a soil auger. There was no significant difference in mean RMSRs between any of the above-mentioned cases.

In summary, it was shown that using the regression tree technique, the estimation of Ks was possible using only qualitative texture- and structure-related soils information that generally resides in most national and regional soil survey databases, without losing estimation accuracy.

Identification of Tree Structures and Potentially Useful Input Attributes
Four of the models were selected for further discussion (underlined in Table 3). The model ranked 20, which used data from all available input variable groups, was used to identify the most important individual input fields among those that were available. The remaining models are: the model ranked 1 (that is, the best overall model and one that used only qualitative inputs), the model ranked 29 (the best model that used only structure-related inputs) and the model ranked 30, which is the model that used the most readily determined attributes (HOR and TXT) and still performed well.

As already indicated, the best performing pruned trees had around five terminal nodes. Therefore, we used five (or fewer) terminal nodes in these models that are further discussed. As the theoretical maximum number of terminal nodes at each level, m, is 2m, five terminal nodes can be obtained using either three or four partition levels, depending on the position of the partitions. For this reason, we only discuss the branching of trees to a maximum of four levels.

The probabilities of individual input fields appearing at the particular level of partitioning when using all possible inputs are shown in Table 4 . The last column shows the probability of the particular input appearing at least once at this level in the tree model. The penultimate column shows at what frequency input variables appeared at a particular level. As the same variable may appear more than once at the same level but on different branches, the probability that a variable will be partitioned can exceed 1.


View this table:
[in this window]
[in a new window]

 
Table 4. Input variables, partitioning values, and their probabilities in each partitioning level for the full regression tree model using all available input variables.

 
In 94 out of 100 replicates, measured sand content was the first partitioning factor, with a partition value around 80% (which includes all sand-textured soils and about half of the loamy sand soils) and appears to indicate those soils that are likely to have the greatest hydraulic conductivity. The other primary partitioning variable in 6% of the cases was PS7 (ped sizes >100 mm). In almost 80% of cases, these soil horizons had a massive structure, implying few macropores and slow hydraulic conductivities. Those samples classified as coarse textured (sand content >80%) or representing the largest ped size class were not further partitioned in any of the replicates.

In 85% of cases, partitioning of all other samples at the second level was based on the largest ped size class (PS7). In all such cases, the primary partitioning variable was sand content. Topsoil or subsoil distinction (HOR) and sand content were other important variables at this level, with 8 and 6% probabilities, respectively.

Organic matter content, Db, and HOR appeared most frequently as partitioning variables at the tertiary level, with frequencies of 49, 21, and 14%, respectively. These properties appear to be closely related to each other. Topsoil would typically have greater OM content and lesser Db than subsoil. Greater biological activity by both plants and animals will generally lead to the development of more stable aggregates and smaller peds in topsoil and, therefore, to larger Ks values. In a total of 15% of all cases, the largest ped size class (PS7), silt and clay content, and cracking in both directions (BOTH) appeared as partitioning factors at this tertiary level. The dominance of a single variable diminished at the fourth partitioning level, with 13 variables involved in the partitioning of the data set. Variables that appear infrequently at a higher level typically become very frequent partitioning variables at the next level.

In summary, coarse-textured horizons and those with the largest ped sizes appear to be the most important factors in delineating groups of soils with similar Ks. The variability of Ks within the rest of the soil horizons appears to be so large that it is more difficult to isolate different groups from each other, as shown by the large variety of partitioning variables at lower levels. A second indication of large within-group variability is that fewer input variables are sufficient to provide the same estimation accuracy by the tree models (Table 3). Such models are discussed next.

Tree Structures and Potentially Useful Input Attributes Derived Using Fewer Input Variables
The best model used only HOR, PED, and TXT as inputs. Unsurprisingly, the most important partitioning variables in that model were identical to, or representative of, those in the model that used all six input variable groups. Measured sand content did not occur in this model, but the primary partitioning variable was the sand texture class in 77% of all replicates. The largest ped size class (PS7, peds >100 mm) was the primary partitioning variable in the remaining 23% of cases. These two variables were the most important secondary partitioning variables when the other was the primary partitioning variable. Topsoil or subsoil distinction (HOR) was the third most frequent variable that appeared at the secondary level (6% of cases). There was only one partition at the secondary level in all replicates, as the coarse (sand) textured soils and soils with the largest peds that were separated at the first level were not partitioned any further (Table 5 ). The horizon being topsoil or subsoil was by far the most important tertiary partitioning factor, appearing in 87% of models and providing 92% of all variables that appeared at this level. Three different texture classes, mostly delineating fine-textured soils, and four different ped size classes appeared at this level at a probability of 6% or less. Different texture classes as well as ped size classes appeared at the fourth level. The most dominant partitioning variables at this level were fine texture classes and the smallest ped size classes. The most likely tree model to appear, using these input groups with five terminal nodes, is similar to that shown in Fig. 1 .


View this table:
[in this window]
[in a new window]

 
Table 5. Input variables and their probabilities in each partitioning level for the tree model using topsoil or subsoil distinction (HOR), textural class, and ped size classification.

 

Figure 1
View larger version (20K):
[in this window]
[in a new window]

 
Fig. 1. Tree structure of the most likely tree model using topsoil or subsoil distinction, texture, and ped size classification.

 
Although lab-based determination of soil texture is often more accurate, information on texture class appears more frequently in existing soil databases than information on detailed particle size distribution. This model provides estimates of Ks (Table 3) similar to models that require more sophisticated input variables or a more complex tree structure.

Two more models of interest are the models that use (i) HOR and PED only and (ii) HOR and TXT only. The first of these two models was the second best tree model in our study that used only two variable groups as input (Table 3). Those two groups consisted of only qualitative variables that could be determined directly in the field. This model was, on average, less accurate than the full model that used all groups of input variables, but the difference was not significant. Table 6 and Fig. 2 show the list, rank, and importance of input variables for this model. Three variables dominated the first two levels: the smallest and largest ped size classes (PS1 and PS7, respectively) and HOR. While we have seen HOR and PS7 as important partitioning variables in the models discussed above, PS1 has not appeared at the primary or secondary levels before. The largest ped-size class (PS7) was the main partitioning variable in all but one case, and the smallest ped size class was the dominant partitioning variable at the secondary level (98%). For all non-PS7 and -PS1 horizons, HOR became the dominant tertiary partition (97%). The variables PS1 and PS7 did not induce any further partitions in the data. All but two ped size classes (PS1 and PS7) appeared at the fourth level, as those two had already appeared at a higher level in all replicates.


View this table:
[in this window]
[in a new window]

 
Table 6. Input variables and their probabilities in each partitioning level for the tree model using topsoil or subsoil distinction (HOR) and ped size classification.

 

Figure 2
View larger version (16K):
[in this window]
[in a new window]

 
Fig. 2. Tree structure of the most likely tree model using topsoilor subsoil distinction and ped size classification.

 
It appears as if PS1 (ped size 1–2 mm) simply replaced the sand texture class or sand content in the models where neither of those variables were used, as this ped size range is similar to the size range of coarse sand. This may, in part, be an artifact of the initial classification where apedal soil horizons described as single grain were allocated a ped size class of PS1. This grouping, however, also includes finer textured but aggregated soils.

Another model that performed well using only two input groups has similar estimation accuracy to that above. Its significance is that it used only HOR and TXT as input. Table 7 shows the list, ranking, and importance of input variables for this model and Fig. 3 shows the most likely tree model to appear using these inputs. The top two levels of partitioning occur due to differences in Ks among the sand texture class (first level) and the HOR variables (second level). In this case, however, no other variable appeared at the first or second levels in any of the replicates. At the third level, topsoil and subsoil were further partitioned in all replicates, meaning that in all cases three levels were enough to obtain five terminal nodes on the tree. Most frequently, silty clay texture was the partitioning variable for subsoil (89%) and clay loam or clay for topsoil (41 and 28% respectively), meaning that finer textured soils were separated from all others.


View this table:
[in this window]
[in a new window]

 
Table 7. Input variables and their probabilities in each partitioning level for the tree model using topsoil or subsoil distinction (HOR) and textural classification only.

 

Figure 3
View larger version (14K):
[in this window]
[in a new window]

 
Fig. 3. Tree structure, the mean estimates, and their standard deviations using the most likely tree model using topsoil or subsoil distinction and textural classification.

 
So far, this study has concentrated on determining those variables that would be most useful in deriving PTFs. After the most probable tree structure and the most probable inputs are determined, Ks can be estimated along with its uncertainty. To illustrate this approach, the model based on the simplest and most readily available data (HOR and TXT variables only) was used (Table 7, Fig. 3). Once again, 100 replicates were derived, but the same model structure was fixed in each of the 100 cases using two of the most likely model structures. The log10(Ks) was estimated using 411 samples to develop the regression tree model and node response values and tested against measured Ks values from 91 soil horizons. The values shown in Fig. 3 reflect the outcome of such estimations. We only developed models that used silty clay (ZC) as the tertiary partition for non-sand subsoils, and clay loam or clay (CL or C, respectively), the two most frequent partitions, for non-sand topsoils.

As expected, topsoils had faster Ks values than subsoils. Typically, more coarse-textured soils had large Ks values, with the exception of clay vs. nonclay topsoils. The mean Ks value for the node is relatively large and significantly greater than for any other non-sand groups in the same tree. While this appears unexpected based solely on the texture of those soils, it can be explained by considering the soil structure. Most clay-textured soils in this data set displayed both horizontal and vertical cracking (67%) and 75% had aggregates between 5 and 50 mm, allowing water to flow through the many structural cracks. Average RMSRs for the test data set were slightly less (0.9711 if clay loam and 0.9702 if clay) than the corresponding values shown in Table 3. This may be due to the fact that only the most dominant model structures were used; however, the difference was not significant.


    CONCLUSIONS
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
The aim of this study was to identify which, if any, of the more qualitative soil properties that are routinely determined as part of soil profile characterization, such as soil structure assessments, can be used to either improve pedotransfer predictions of Ks or to derive PTFs in data-poor environments. Generally, there is considerably more information available in national and regional soil survey databases on categorical attributes such as texture and structure than on laboratory-measured properties.

We used regression trees to explore the potential for developing class PTFs to predict Ks. While the ability of PTFs to accurately estimate soil properties is often closely linked to the underlying data, the resampling technique used in this study enabled us to simulate the likely outcome had a number of different data sets been used, making the results more widely applicable.

The most accurate regression tree model used only horizon designation, ped size, and soil texture class as input variables, all of which require no laboratory measurement and are widely available in soil morphological databases. Interestingly, in the majority of cases, the major partition associated with these input variables was both textural (sand texture) and structural (massive structures or peds >100 mm), which separated those soils with fast conductivities in coarse textures from those with slower conductivities with dominantly macropore flow through few widely spaced pores.

Although there is large heterogeneity observed in Ks values even among soils with similar properties, it was found that five terminal nodes was the optimum number of partitions for this data set. Although further partitions were possible, the development of larger trees did not appear to improve the predictive capability of this technique. Instead, further partitioning may simply uncover data structures within the development data set that do not exist within any other test or application data sets, leading to increasing estimation errors. This may also highlight a limitation in using the fairly crude estimator of ped size. Estimates of biopore volume, for example, may help in the development of more robust partitioning. Also, this approach may benefit from the use of novel techniques and collection of new types of information related to pore structure. Until such new information becomes available, however, the performance of this modeling approach suggests that existing qualitative or structure-related input data can be successfully used to estimate Ks.

Despite the relatively few homogenous groups identified by the regression tree approach, there is a considerable reduction in the uncertainty of the estimations within each group. While using the data set mean would have an uncertainty of more than an order of magnitude for this data set, the uncertainty in the estimates for the tree nodes was reduced to less than 0.2 of an order of magnitude, and in most cases under 0.1. The practical benefit of developing pedotransfer functions to predict soil hydraulic conductivity is in providing input data for soil water and solute transport modeling. Modeling studies increasingly focus on estimating the uncertainty of the results along with the main output; therefore, large uncertainty in Ks would propagate in the simulation models to yield final outputs that have undesirably wide confidence intervals. Our models help reduce uncertainty, and they do so by using inexpensive, widely available, and often undervalued soil information.

Overall, we have shown that simple, routinely collected qualitative soil attributes such as texture and structure that are often underutilized can be used to estimate saturated hydraulic conductivity and thereby support simulation modeling.


    ACKNOWLEDGMENTS
 
A. Lilly would like to acknowledge the financial support of the Scottish Executive Environment and Rural Affairs Department and thank the USDA-ARS for kindly providing support and financial assistance for a sabbatical visit at the inception stage of this work. During completion of this study, A. Nemes was affiliated with the University of California, Riverside, and the USDA-ARS Hydrology and Remote Sensing Laboratory, in Beltsville, MD.


    NOTES
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Permission for printing and for reprinting the material contained herein has been obtained by the publisher.

Received for publication November 11, 2006.


    REFERENCES
 TOP
 NOTES
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 




This article has been cited by other articles:


Home page
Soil Sci.Home page
A. Nemes, D. J. Timlin, Ya. A. Pachepsky, and W. J. Rawls
Evaluation of the Rawls et al. (1982) Pedotransfer Functions for their Applicability at the U.S. National Scale
Soil Sci. Soc. Am. J., August 19, 2009; 73(5): 1638 - 1645.
[Abstract] [Full Text] [PDF]


Home page
Soil Sci.Home page
M. P. Martin, D. Lo Seen, L. Boulonne, C. Jolivet, K. M. Nair, G. Bourgeon, and D. Arrouays
Optimizing Pedotransfer Functions for Estimating Soil Bulk Density Using Boosted Regression Trees
Soil Sci. Soc. Am. J., March 1, 2009; 73(2): 485 - 493.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Lilly, A.
Right arrow Articles by Pachepsky, Ya. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Lilly, A.
Right arrow Articles by Pachepsky, Ya. A.
Agricola
Right arrow Articles by Lilly, A.
Right arrow Articles by Pachepsky, Ya. A.
Related Collections
Right arrow Hydraulic Conductivity
Right arrow Soil Hydrology
Right arrow Soil Physics


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
The SCI Journals Agronomy Journal Crop Science
Journal of Natural Resources
and Life Sciences Education
Vadose Zone Journal
Journal of Plant Registrations Journal of
Environmental Quality
The Plant Genome