中南大学学报(英文版)

J. Cent. South Univ. (2016) 23: 1040-1051

DOI: 10.1007/s11771-016-0353-z

Application of multivariate statistical techniques in assessment of surface water quality in Second Songhua River basin, China

ZHENG Li-yan(郑力燕), YU Hong-bing(于宏兵), WANG Qi-shan(王启山)

College of Environmental Science and Engineering, Nankai University, Tianjin 300071, China

Central South University Press and Springer-Verlag Berlin Heidelberg 2016

Abstract:

Multivariate statistical techniques, such as cluster analysis (CA), discriminant analysis (DA), principal component analysis (PCA) and factor analysis (FA), were applied to evaluate and interpret the surface water quality data sets of the Second Songhua River (SSHR) basin in China, obtained during two years (2012-2013) of monitoring of 10 physicochemical parameters at 15 different sites. The results showed that most of physicochemical parameters varied significantly among the sampling sites. Three significant groups, highly polluted (HP), moderately polluted (MP) and less polluted (LP), of sampling sites were obtained through Hierarchical agglomerative CA on the basis of similarity of water quality characteristics. DA identified pH, F, DO, NH3-N, COD and VPhs were the most important parameters contributing to spatial variations of surface water quality. However, DA did not give a considerable data reduction (40% reduction). PCA/FA resulted in three, three and four latent factors explaining 70%, 62% and 71% of the total variance in water quality data sets of HP, MP and LP regions, respectively. FA revealed that the SSHR water chemistry was strongly affected by anthropogenic activities (point sources: industrial effluents and wastewater treatment plants; non-point sources: domestic sewage, livestock operations and agricultural activities) and natural processes (seasonal effect, and natural inputs). PCA/FA in the whole basin showed the best results for data reduction because it used only two parameters (about 80% reduction) as the most important parameters to explain 72% of the data variation. Thus, this work illustrated the utility of multivariate statistical techniques for analysis and interpretation of datasets and, in water quality assessment, identification of pollution sources/factors and understanding spatial variations in water quality for effective stream water quality management.

Key words:

Second Songhua River basin; water quality; multivariate statistical techniques; cluster analysis; discriminant analysis; principal component analysis; factor analysis

1 Introduction

Rivers and their catchments have been utilized by mankind for thousands of years to the extent that few of them are now in their natural condition [1]. Changes in water quality can be potentially catastrophic for aquatic ecosystems as species are threatened by conditions which are no longer suitable for their survival. These changes in water quality also pose threats to humans through changes in water utilized for recreation, fishing and industry [2]. It is imperative to have reliable information on water quality for effective and efficient water management [3].

In order to restore the health of the river water quality and prevent its further pollution, one of such critical efforts is the development of the surface water monitoring network [4]. Reductions in water quality can be caused by increases in concentration of pollutants or contaminants like oil, heavy metals, and organic compounds [5], increases in turbidity [6], and changes in dissolved oxygen [7], which all have implications for the well-being of aquatic ecosystems and human. Therefore, the effective, long-term management of rivers requires monitoring of a wide range of physical, chemical and biological parameters from many monitoring stations. This results in huge and complex data sets, which are often difficult to interpret and draw meaningful conclusions [8]. Furthermore, in river water quality assessment, it is frequent to determine whether a variation in the concentration of measured parameters should be attributed to anthropogenic activities or to natural changes. Also, it should be determined which parameter is the most significant to describe such spatial variations, the pollution sources, etc.

The usual practice of water quality assessment is the comparison of measured physicochemical parameters with threshold values recommended by national or international bodies. But this method does not provide evidences on the pollution sources. Another method which can be adopted to deal with the multidimensional data and enable an overall evaluation of water quality and ecological status of the study region is the use of multivariate statistical techniques [9]. These techniques,such as cluster analysis (CA), discriminant analysis (DA), principal component analysis (PCA) and factor analysis (FA), also permit identification of the possible factors/ sources that are responsible for the variations in water quality and influence the water system in apportionment of the sources, which, thus offers valuable tool for developing appropriate strategies for effective management of water resources [10-11].

In recent years, many studies related with these methods have been carried out. For instance, most commonly, CA is applied to discrete samples, relying on their distribution throughout a study region to infer spatial patterns, based upon multiple water quality parameters [9, 12-17]. DA is applied to determine the variables responsible for the variation in water quality in different seasons and different locations throughout freshwater river systems [13]. However, studies involving the application of DA to water quality are limited and are less common than CA. PCA, which includes FA, is one of the most powerful and common techniques used for reducing the dimensionality of large sets of data with minimum loss of information. FA helps in identifying the possible factors/sources that influence water quality [18]. Thus, PCA/FA appears more widely used in the field of water quality/research than either CA or DA [19, 13].

The Second Songhua River (SSHR) Basin is the base of the industry and agriculture of Jilin province, China. It has been well known for a spill of an estimated 100 t of toxic substances made up of a mixture of benzene, aniline and nitrobenzene from Jilin Chemical Company to the river in November 2005. Many monitoring sections have been established around big cities in recent years, and a huge monitoring data base, including organic properties, physical and chemical properties, nutrients, inorganic constituents and heavy metal etc has been put in place through these programs [20]. Despite this, there is a lack of knowledge and understanding regarding the water quality of the SSHR and its variation through the locations. Therefore, this work attempts to facilitate the understanding of system behavior with respect to water quality issues and management within the basin.

In this work, different multivariate statistical techniques were performed for the large data sets obtained during two years (2012-2013) of monitoring of 10 physicochemical parameters at 15 different sites in the SSHR basin, aiming to (1) examine the spatial variations of river water quality parameters and identify several zones with different water quality, (2) determine the discriminant variables that are most important in assessing and monitoring water quality, and (3) identify factors/sources influencing the chemistry of the river water. The overall aim of the present work is to provide useful information for water resources management at the watershed scale and help managers understand the main sources of pollution in the SSHR basin.

2 Materials and methods

2.1 Study area

The Second Songhua River (SSHR) basin is located in the far northeast of China (124°36′ to 128°50′E and 41°44′ to 45°24′N) and is a major sub river basin of the Songhua River basin (Fig. 1). The SSHR originates from Tianchi Lake in the Changbai Mountain and travels 795 km from southeast to northwest. It drains a catchments area of about 78182 km2, with an annual mean discharge of 14.8 km3.

The topographic relief of the basin is high in the east and low in the west [21]. The precipitation is mainly concentrated during the months from June to September with an annual total of about 668 mm. Monthly average temperature fluctuates from -19 °C in January to 22 °C in July with the annual average (about 2.5 °C).

Generally speaking, the upper reaches of the SSHR basin is mountainous with rich forest resources and great water conservancy projects; the middle and lower reaches of the basin are morphologically dominated by hills and plains with rich mineral resources, fertile soil, dense population as well as developed industry and agriculture (Fig. 2). The SSHR receives pollution load from both the point and non-point sources. It receives agricultural run-off from its vast catchments area directly or through its tributaries and wastewater drains. Especially, an important petrochemical city of China, Jilin City, is located at the middle reach of the Second Songhua River. Historically, raw or primary industrial effluent of Jilin City was usually discharged to the river, resulting in the serious mercury pollution of the river during the 1960s to 1970s [22]. The SSHR has been identified as one of the most polluted rivers in the Songhua River basin.

2.2 Data

Based on the hydrologic river features and on our previous results, 15 sampling sites located in the SSHR basin (Fig. 1) and 10 physiochemical parameters (water temperature (T), pH, fluoride (F), dissolved oxygen (DO), ammonia nitrogen (NH3-N), chemical oxygen demand (COD), permanganate index (CODMN), 5-day biological oxygen demand (BOD5), volatile phenols (VPhs), and escherichia coli (E. coli)) obtained from each site were used for analysis. The sampling sites were selected and determined according to the national and provincial monitoring network of the SSHR whose selection criteria were related with the geomorphology of the main channel, the hydrological regime, the localization of the urban and industrial discharges, and the land-use types of the riverside. Selected sites were sampled every month for two years (2012-2013).

Fig. 1 Map of study area and surface water quality sampling sites in SSHR basin, China

Fig. 2 Land use and land cover in SSHR basin in 2007

2.3 Data treatment

The Kolmogorov-Smirnov (K-S) statistics were used to test the goodness-of-fit of the data to log-normal distribution because most multivariate statistical methods require parameters to conform to the log-normal distribution. Kaiser-Meyer-Olkin (KMO) and Bartlett’s Sphericity tests were performed to examine the suitability of the data for PCA/FA [16]. CA, PCA and FA were applied to experimental data standardized through z-scale (mean=1, variance=0) transformation in order to avoid misclassification due to wide differences in data dimensionality [9, 13, 23].

2.4 Cluster analysis

Cluster analysis (CA) is grouping objects (cases) into classes (clusters/groups) so that objects within a class are similar to each other but different from those in other classes [12, 2]. Hierarchical agglomerative cluster analysis (HACA) is the most general approach, which starts with the most similar pair of objects and forms higher clusters step by step. The similarity between two samples is usually given by the Euclidean distance, and a “distance” can be represented by the “difference” between analytical values from both samples. The process of forming and joining clusters is repeated until a single cluster containing all samples is obtained, and the result can be displayed as a dendrogram or tree diagram [24]. The dendrogram provides a visual summary of the clustering process, presenting a picture of the groups and its proximity with a dramatic reduction in dimensionality of the original data [12].

In this work, HACA is presented on the normalized data set using the Ward’s method as agglomeration technique and squared Euclidean distance as a measure of similarity. The Ward’s method employs an analysis of variance approach to evaluate the distances between clusters, attempting to minimize the sum of squares of any two clusters that can be formed at each step. In general, the Ward’s method yields the most meaningful clusters and has been proved to be an extremely powerful grouping mechanism. The Euclidean distance (linkage distance) is reported as Dlink/Dmax, which represents the quotient between the linkage distances divided by the maximal distance. The quotient is usually multiplied by 100 as a way to standardize the linkage distance represented by the y-axis [10, 13, 16].

2.5 Discriminant analysis

Discriminant analysis (DA) can address a number of research questions including, but not limited to, determining whether statistically significant differences exist between two or more known groups, determining which independent variables account for the majority of the differences between groups, and establishing procedures for classifying objects into groups [2, 13].DA builds up a discriminant function (DF) for each group. The calculation of the DF can be achieved via a simultaneous method, where all independent variables are considered at once, or a stepwise method, where variables are considered sequentially. The simultaneous method is used when the analyst wants to include all the independent variables and is not interested in intermediate results to determine the most discriminating variables. The stepwise method can be either forwards starting with the best discriminating variable and including variables until all the variables which prove useful in discriminating between groups are chosen, or backwards, whereby variables are eliminated starting with the least significant first. The accuracy of the DF may be assessed using independent data, such as a subset of the original data, or a new data set (different from that used for function construction) [2, 25].

In this work, the groups for spatial sampling sites evaluations have been selected. DA was applied to raw data by using the standard, forward stepwise and backward stepwise modes to construct DFs to evaluate the spatial variations in river water quality. The sites were the grouping variables, while all the measured parameters constituted the independent variables.

2.6 Principal component analysis/factor analysis

PCA/FA can be used to analyze interrelationships among a large number of variables and to explain these variables in terms of their common underlying dimension by providing empirical estimates of the structure of the variables with minimum loss of information [2, 9, 23]. PCA starts with the covariance matrix describing the dispersion of the original variables (measured parameters), and extracting the eigenvalues and eigenvectors. An eigenvector is a list of coefficients (loadings or weightings) by which we multiply the original correlated variables to obtain new uncorrelated (orthogonal) variables, called principal components (PCs), which are weighted linear combinations of the original variables. There are as many PCs as original variables, however, PC provides information on the most meaningful parameters, which describe our whole data set affording data reduction with minimal loss of information [9, 23].

In practice, FA follows PCA. The main purpose of FA is to reduce the contribution of less significant variables in order to simplify even much of the data structure coming from PCA. This purpose can be achieved by rotating the axis defined by PCA, according to well-established rules, and constructing new groups of variables, also called varifactors (VFs). It should be noted that a PC is a linear combination of observable water quality variables, while a VF can include unobservable, hypothetical, ‘‘latent’’ variables [9, 23]. In this work, we performed a varimax rotation (raw) of the PCs coming from the original standardized variables, in order to reduce the contribution of variables with minor significance.

3 Results and discussion

3.1 Descriptive statistics

Table 1 summaries briefly the maximum, minimum and mean value and standard deviation of the 10 measured parameters in the river water samples from the 15 sampling sites in the SSHR basin. Coefficient of variation (CV) showed that most of the parameters (E. coli, COD, BOD5, NH3-N, T, CODMn and DO) fluctuated significantly along the SSHR basin. This indicates variability in chemical composition between samples, which points to the presence of spatial variations caused likely by polluting sources throughout the study region.

Recommended guide levels of these variables allowed by the Environmental Quality Standards for Surface Water (EQSSW, GB 3838—2002) (National Environmental Protection Agency of China, 2002) and World Health recommended maximum permissible limits (WHO, 2004) are included in Table 1. It must be emphasized that average concentrations of some variables such as DO, NH3-N, COD, CODMn, BOD5 and E. coli are 1-3 folds of the Grade III standard and international (WHO, 2004) standards, therefore, this water resource is not adequate for human consumption or industrial purposes and needs to be purified.

Significant variation in water temperature was observed in all sites. S1 with the lowest water temperature was located in mountain areas where vegetation coverage is high and population density is low. The highest value of T is recorded at sampling sites S13 and S14. Besides, water temperature also shows a very characteristic annual cycle, with higher average values during the summer (24.23 °C), and lower average values in the winter season (7.67 °C). Water temperature influences the chemical and biological activity besides growth of aquatic organisms.

The pH values of collected water samples vary significantly among the sampling sites, ranging from 6.80 to 11.26, and the highest pH is determined in S1. The pH value ranging between 6.0 and 8.5 indicates productive nature of water body. High pH value in S1 can be described to the discharge of volcanic lava, which puts large amounts of alkaline matters into the river system.

Variation in F is significant with and between the sampling sites. Higher values of F are recorded in S13, S15, S14 and S8 as compared to other sampling sites, which are mainly due to the high degree inflow of industrial effluents. However, high concentration of F in S2 may be resulted from the natural inputs, such as volcanic ash.

Noticeable depletion of DO and alarming level of COD concentrations are recorded in this work, indicative of potential ecological and environmental risk. The sharp decline in DO may have resulted from introduction of organic matter in the water which consumes oxygen during decomposition [26]. The lower concentrations of DO are recorded in the plain streams. The lowest DO level is found in S14, and the DO values in S13 and S15 are also significantly lower than those in other sites. Such low DO values may be caused by discharge of domestic and industrial wastes into the river due to dense population and intensive industries in the area [1]. BOD5 and COD have similar distribution with the high values in plain streams (such as the sampling sites of S13, S14, and S15) corresponding to a high density of residents in cities and towns. This suggests that the discharge of industry and domestic wastewater induces serious organic pollution in these sites. Moreover, extremely low DO and high COD contents usually indicate the degradation of an aquatic system [27].

NH3-N shows significant variation among the sampling sites. These sampling sites in the downstream of the SSHR mainstream and major tributaries (Yinma

River, Yitong River and Huifa River) show higher NH3-N contents than the others. The concentrations of NH3-N tend to increase in areas with higher density of urban and agricultural land uses, indicating that these high levels may be ascribed to the input of wastewater in industrial areas and the application of nitrogen fertilizers in arable areas [11]. NH3-N is much lower in mountainous and vegetated area than in urban land use areas.

Table 1 Statistical descriptive of water quality in SSHR basin, China

VPhs exhibit significant spatial differences in the basin during the study periods. The highest value of VPhs is found in S15, while S9 and S10 show relatively higher VPhs contents than the others. This is attributed to the high degree of industrial effluents.

E. coli level in sampling sites varies significantly. The highest values of E. coli are found in S13 and S8, while S14, S15, S11, S6 and S7 show relatively higher E. coli contents than the others. E. coli is used to indicate fecal pollution. The fecal pollution mainly represents the contribution of non-point sources, such as wastewater from aquaculture and animal husbandry in some livestock farms.

3.2 Spatial similarity and site grouping

HACA was performed on the water quality data sets to evaluate spatial variation among the sampling sites. It yielded a dendrogram (Fig. 3), grouping all 15 sampling sites of the basin into three statistically significant clusters. Since we used HACA, the number of clusters was also decided by practicality of the results as there is ample information (land use, location of wastewater treatment plants, etc.) available at the study sites.

The cluster A is formed by the sampling sites S1, S3, S9, S10 and S12 and corresponds to relatively less polluted (LP) sites. In cluster A, site S1 is situated at the upstream mountain area of the Second Songhua River. This area has characteristics of high vegetation coverage,low population density, scarcity of agricultural and industrial activities, and low volume of sewage discharge (Fig. 2). Thus, there is very good water quality. Three sites (S3, S9 and S12) are located in the drinking water source protect region, where the water of a particularly high quality is needed for drinking water supplies.

Fig. 3 Dendogram based on agglomerative hierarchical clustering (Ward’s method)

The cluster B is formed by the sampling sites S13, S14 and S15 and corresponds to highly polluted (HP) sites. All sites are situated at the middle and downstream area of Yitong River (a tributary of the SSHR). Yitong River is a seasonal river with small annual mean discharge of 540000000 m3, which is related to the low dilution capacity and less active biological, chemical, physical purification capacity of water body. S13 situates in Changchun City region. Changchun, the capital of Jilin Province, is nearly the most developed region in the SSHR, with a population of about 7.67 million (Jilin Statistical Yearbook 2010). There are many chemical, steel, and other heavy industries. Although a large amount of wastewater and sewage is treated in sewage plant, the effluent directly flows into stream without any buffering. The S14 in Nongan County and S15 in Yitong County reach located outside of Changchun City and downstream of Yitong River. As the survey shows, aggregated settlements and farmlands appear on both sides of the river. This reach is the major agricultural economic zone with intensive agricultural activity. There are many ditch nets for irrigation and agricultural runoff. In most cases, the sources and concentrations of non-point source pollutants are the result of land use interactions with the transport system. Also, these areas have a distinct feature of forest loss and frequent mining activities, resulting in channel habitat damage and riparian ecological degradation. Therefore, these sites receive pollution mostly from domestic wastewater, wastewater treatment plants, industrial effluents, agricultural non-point runoff and the maximum loads of COD, BOD5, CODMn, NH3-N and E. coli are shown at the cluster B.

The cluster C is formed by the sampling sites S2, S4, S5, S6, S7, S8 and S11 and corresponds to moderately polluted (MP) sites. Both site S4 and S5 in the SSHR mainstream are the sections in Jilin City, receiving most of urban sewage and industrial effluent. Jilin City is a second-largest city of Jilin Province and is the heavy and chemical industries base. The urban river sections are worst affected by industrial pollution from the chemical plants with insufficient wastewater treatment facilities. S11 situates in Yima River (a major tributary of the SSHR) and near Changchun City. S8 situates in Huifa River (a major tributary of the SSHR), and its location is near Tonghua City. A certain amount of industrial and agricultural wastewater is discharged into rivers with runoff. S6 and S7 are distributed along plain streams, the most downstream of the SSHR, and streams locate in Songyuan City. Although some small factories, domestic wastewater and surface runoff from villages pollute the river, the results show that these sites have similar water quality characteristics with the other sites in cluster C, which suggests that the self-purification and assimilative capacities of the river are relatively strong. The sampling site, S2, however, located in the extreme upstream of the SSHR, receives large amounts of pollution from the natural inputs (i.e., volcanic ash), and shows the higher average concentration of F.

The water quality has a strong gradient of degradation from mountains to plains. There have been similar results concluded from the CA classification. This indicates that the CA technique is useful in evaluating spatial differences in water quality. However, CA fails to give details of such differences.

3.3 Spatial variations in river water quality

Further evaluation of the spatial differences in river water quality throughout the SSHR basin is performed using DA. The objective of DA is to determine the most significant parameters associated with the difference between the clusters. Spatial DA is performed with the raw data sets comprised of 10 parameters after grouping into three major classes of HP, MP and LP as obtained through CA (Table 2). The site (clustered) is the grouping (dependent) variable, while all the measured parameters constitute the independent variables. Discriminant functions (DFs) and classification matrices (CMs) obtained from the standard, forward stepwise and backward stepwise modes of DA are given in Table 3 and 4. The objectives of DA are to test the significance of DFs and to determine the most significant parameters associated with the difference between the clusters.

The standard DA mode constructs DFs including 10 parameters (Table 3), and the E. coli group coefficients are 0.000. These standard DFs render a CM where 91.4% of the cases are correctly assigned (Table 4) but using 10 parameters. The forward stepwise mode, which includes variables step-by-step beginning with the more significant ones until no significant changes are obtained,gives a CM with 91.7% right assignations (Table 4) using 7 discriminant parameters (Table 3). The backward stepwise mode, which removes variables step-by-step beginning with the less significant ones until nosignificant changes are obtained, gives a CM with 91.7% right assignations using only 6 parameters (Table 3), with little difference in match with each region compared with the standard and forward stepwise mode (Table 4). These spatial DFs are validated by using a different data set, obtaining at least 73% right assignations. Such a percentage is reasonable considering that the validation set corresponds to a different from that was used to construct the original DFs. Thus, DA results suggest that pH, F, DO, NH3-N, COD and VPhs are the most significant parameters to discriminate among HP, MP and LP regions, which means that these six parameters account for most of the expected changes (Table 3, backward stepwise).

Table 2 Wilks’ lambda and Chi-square test of DA of spatial variation

Table 3 Discriminant functions (Eq. (1)) for discriminant analysis of spatial variations in SSHR basin

Table 4 Classification matrices (CMs) for discriminant analysis of spatial variations in SSHR basin

DA affords the best results for spatial analysis. The correct assignations (91.7%) by DA for three different site clusters (HP, MP and LP) further confirm the adequacy of DA. DA shows that there are significant differences among these three regions, which are expressed in terms of six discriminating parameters. However, the data reduction from DA is not as great as expected, mainly because we need six parameters (60% of the 10 measured) to explain the spatial variance in the SSHR water quality.

3.4 Identification of important spatial water quality parameter and pollution source

Further evaluation of the spatial changes is performed using PCA/FA. PCA/FA is performed on the normalized data sets containing 10 variables, separately for the three different regions (LP, MP and HP) and the whole basin, as delineated by CA techniques, to compare the compositional pattern between analyzed water samples and identify the source influencing each one.

Preliminary analysis prior to PCA/FA was conducted; a quick check at the Kaiser-Meyer-Olkin (KMO) test and the Bartlett test of sphericity table was done to ensure no violation of the assumption of factor analysis. The KMO results for the HP, MP and LP regions and the whole basin were 0.729, 0.671, 0.631 and 0.815, and Bartlett’s sphericity test was significant (0.001, p<0.05), showing that PCA/FA could be considered appropriate and useful to provide significant reduction in data dimensionality.

PCA of the three data sets yields three PCs for the HP and MP sites and the whole basin and four PCs for the LP sites with eigenvalues >1, explaining 70%, 62%, 71% and 72% of the total variance in respective water quality data sets. Equal numbers of VFs are obtained through FA performed on the PCs. Results of FA including factor loadings, eigenvalues and total and cumulative variance values are presented in Table 5. The factor loadings express the correlation between the original variables and the newly formed varifactors. The VF loadings can be used to determine the relative importance of a variable as compared to other variables in a factor and do not reflect the importance of the other itself [5]. Classification of factor loadings is “strong”, “moderate”, and “weak”, corresponding to absolute loading values of >0.75, 0.75-0.50 and 0.50-0.30, respectively [28].

For the data set representing the HP sites, among total three VFs, VF1, explaining 26.85% of total variance, has strong positive loadings on F, whereas strong negative loadings on E. coli and moderate negative loadings on DO. In the HP region, this fluoride pollution comes mostly from anthropogenic inputs, such as unloading from the factories and burning a mixture of clay and coal. Besides, the present of E. coli in the water also indicates a strong effect of organic pollution from the fecal wastes of humans and other warm-blooded animals. The high levels of fecal wastes mainly represent the contribution of non-point source, such as wastewater from domestic sewage and aquaculture and animal husbandry in some livestock farms. The negative factor loading of DO on this factor suggests the utilization of DO to decompose the organic matter by bacterial function. VF2, explaining 25.50% of the total variance, has strong positive loadings on NH3-N, BOD5 and COD and strong negative loading on pH. This varifactor can be explained taking into account high levels of organic matter; organic matter in urban wastewater consists mainly of carbohydrates, proteins and lipids which, as the amount of available dissolved oxygen decreases, undergo anaerobic fermentation processes leading to ammonia and organic acids. Hydrolysis of these acidic materials causes a decrease of water pH values. High loading on organic compounds in the water body indicates that the river is heavily polluted due to anthropogenic activities through point and non- point sources taking place near the study area (Table 6). VF3, explaining 17.71% of the total variance, has moderate positive loadings on VPhs and CODMn and strong negative loading on T; these parameters are indicators of toxic organic pollution. Volatile phenols in the water explain the point pollution of this region, such as wastewater from the refinery, oil-production and coking factories.

Table 5 Factor loadings matrix and explained variance of water quality parameters in three regions

For the dataset pertaining to water quality in the MP sites, among three VFs, VF1 is the most important with 30.35% of the total variance and has strong positive loadings on COD and BOD5, and moderate positive loadings on CODMn and NH3-N. This factor can be explained as oxygen-consuming organic and inorganic nutrients pollution. ZHOU et al [29] correlated positive loading on BOD5, COD and CODMn with organic pollution due to waste disposal activities. This oxygen-consuming organic factor mainly indicates the point sources of wastewater treatment plants and industrial operations and non-point sources of domestic sewage. The inorganic nutrients may be related to anthropogenic non-point pollution of domestic sewage and agricultural activities. VF2, explaining 15.85% of the total variance, has strong positive loadings on F and moderate positive loadings on E. coli; these parameters are indicators of toxic inorganic and fecal pollution. The inorganic fluorine pollution in the middle and lower reaches of the SSHR represents pollution sources from industrial activities, whereas that in the upper reaches mostly comes from the natural inputs (e.g., volcanic ash). Meanwhile, the fecal pollution mainly represents the contribution of non-point sources, such as wastewater from domestic sewage and some livestock farms. VF3, explaining 15.64% of the total variance, has a high and positive load of temperature and strong negative load of DO. The inverse relationship between temperature and dissolved oxygen is a natural process because the solubility of oxygen in water decreases with increasing temperature. The discharge of industry and domestic wastewater induces serious organic pollution in these sites, since the decrease of DO is mainly caused by the decomposition of organic compounds.

Table 6 List of pollution sources into Yitong River (2010)

Lastly, for the dataset pertaining to water quality in the LP sites, among four VFs, VF1 is the most important with 23.99% of the total variance and has strong positive loadings on CODMn, NH3-N and moderate negative loading on BOD5 and COD. In the LP region, most sites are adjacent to agricultural land, so are influenced by agricultural runoff with a great number of pollutants. It may be attributable to the land use change, unreasonable tillage and over-fertilization. The other sites are mainly influenced by the mixed pollution sources, mainly from domestic sewage and agricultural activities. VF2, explaining 19.04% of the total variance, has strong negative loadings on T, and moderate positive loadings on DO and VPhs. The negative loading of temperature is associated with seasonal variation. The inverse relationship between temperature and dissolved oxygen is a natural process. In LP region, S9 in the Huifa River and S10 in the Jiaohe River show relatively higher VPhs contents than the others because of industrial wastewaters. Thus, this factor represents seasonal variation and organic pollution from industrial wastewaters. VF3, explaining 15.04% of the total variance, has strong positive loadings on F and moderate positive loadings on E. coli. This factor represents the contribution of point and non-point sources from industrial and agricultural areas. VF4, explaining 13.06% of the total variance, has strong positive loadings on pH. The positive loading of pH is associated with natural sources. High pH in S1 comes from the natural inputs (e.g., volcanic ash). The other sites are mainly attributed to high rates of photosynthesis by autotrophs, where more consumption of carbon dioxide results in increased pH [30].

Many potential pollution sources have also been identified by using PCA/FA in three different regions of the SSHR (Table 7). Thus, different measures can be carried out to control the water pollution sources in different regions. These point sources of pollution in the HP region should be paid more attention which are also significant latent pollution sources in MP region. As mentioned above, the pollution sources of the LP region are from upstream and drinking water source protect region, so controlling these point sources of pollution in the LP region is of prime importance. Non-point sourcesof pollution are a serious environmental problem in the SSHR basin. Agricultural activities like cultivation or aquatic breeding without advanced techniques bring tremendous nutrients and organic pollution. Thus, the priority is to develop advanced techniques for decreasing non-point sources of pollution in these three regions.

Table 7 Significant latent pollution sources for HP, MP and LP regions

For the dataset pertaining to water quality in the whole basin, among three VFs, VF1 is the most important with 45.19% of the total variance and has strong positive loadings on COD, CODMn, BOD5, NH3-N and F, and moderate negative loadings on DO. VF2, explaining 15.21% of the total variance, has strong negative loadings on pH. VF3, explaining 11.60% of the total variance, has a high and positive load of temperature. According to the results from spatial PCA/FA in the whole basin (Table 6), any water quality parameter with an absolute factor loadings value >95% is considered to be an important parameter contributing to spatial variations of the SSHR water quality [4]. In the SSHR basin, only two parameters (i.e., COD and T) are identified as the most important parameters in contribution to water quality variations.

In general, some of these results from spatial PCA/FA agree with those obtained from DA. However, pH, DO, NH3-N and VPhs are not as significant in FA as in DA. Considering the results from FA, we need two parameters (about 20% of the 10 measured) to explain 72% of the data variation in three catchment regions. Therefore, as a means to identify those parameters having the greatest contribution to spatial changes in the water quality, FA gives a considerable data reduction in this work.

4 Conclusions

Various multivariate statistical techniques are utilized to evaluate spatial variations in surface water quality of the Second Songhua River basin (China) and render different features.

1) Univariate statistical results show that most of the physicochemical parameters vary significantly among the sampling sites. Some variables such as DO, NH3-N, COD, CODMn, BOD5 and E. coli are 1-3 folds of the Grade III standard under Environmental Quality Standards for Surface Water (GB 3838—2002) and international (WHO, 2004) standards.

2) Hierarchical agglomerative CA groups 15 different sampling sites into three clusters, i.e., relatively less polluted (LP), moderately polluted (MP) and highly polluted (HP) sites, based on the similarity of water quality characteristics. The extracted grouping information can be used in reducing the number of sampling sites without missing much information. CA renders good results to evaluate spatial differences. However, CA fails to give details of such difference.

3) DA recognizes the most significant parameters contributing to spatial variations of surface water quality. However, DA does not give a considerable data reduction, because it selects six parameters (pH, F, DO, NH3-N, COD and VPhs) out of 10 measured parameters (40% reduction) to differentiate samples from spatial changes with 91.7% right assignations.

4) PCA and FA assist in extracting and recognizing the factors/sources responsible for river water quality variations in three different regions obtained from CA. FA reveals that the SSHR water chemistry is strongly affected by anthropogenic activities (point sources: industrial effluents and wastewater treatment plants; non- point sources: domestic sewage, livestock operations and agricultural activities) and natural processes (seasonal effect, and natural inputs). Based on absolute factor loadings value >95% selection criterion, VFs obtained from FA indicate that the most important parameter contributing to spatial variations of the SSHR water quality are mainly related to COD and T. PCA/CA gives an important data reduction because it uses only two parameters (about 80% reduction) as the most important parameters to explain 72% of the data variation.

Thus, this work presents usefulness of multivariate statistical techniques for analysis and interpretation of data sets and, in water quality assessment, identification of pollution sources/factors with a view to get better information about the water quality and design of monitoring network/strategy for effective stream water quality management.

References

[1] NGOYE E, MACHIWA J F. The influence of land-use patterns in the Ruvu river watershed on water quality in the river system [J]. Physics and Chemistry of the Earth, 2004, 29(15/16/17/18): 1161-1166.

[2] BIERMAN P, LEWIS M, OSTENDORF B, TANNER J. A review of methods for analysing spatial and temporal patterns in coastal water quality [J]. Ecological Indicators, 2011, 11: 103-114.

[3] BHUIYAN M A H, RAKIB M A, DAMPARE S B, GANYAGLO S, SUZUKI S. Surface water quality assessment in the central part of Bangladesh using multivariate analysis [J]. KSCE Journal of Civil Engineering, 2011, 15(6): 995-1003.

[4] OUYANG Y, NKEDI-KIZZA P, WU Q T, SHINDE D, HUANG C H. Assessment of seasonal variations in surface water quality [J]. Water Research, 2006, 40: 3800-3810.

[5] SHAHIDUL I M, TANAKA M. Impacts of pollution on coastal and marine ecosystems including coastal and marine fisheries and approach for management: A review and synthesis [J]. Marine Pollution Bulletin, 2004, 48(7/8): 624-649.

[6] ORPIN A R, RIDD P V, THOMAS S, ANTHONY K R N, MARSHALL P, OLIVER J. Natural turbidity variability and weather forecasts in risk management of anthropogenic sediment discharge near sensitive environments [J]. Marine Pollution Bulletin, 2004, 49 (7/8): 602-612.

[7] SANCHEZ E, COLMENAREJO M F, VICENTE J, RUBIO A, GARCIA M G, TRAVIESO L, BORJA R. Use of the water quality index and dissolved oxygen deficit as simple indicators of watersheds pollution [J]. Ecological Indicators, 2007, 7(2): 315-328.

[8] DIXON W, CHISWELL B. Review of aquatic monitoring program design [J]. Water Research, 1996, 30: 1935-1948.

[9] VEGA M, PARDO R, BARRADO E, DEBAN L. Assessment of seasonal and polluting effects on the quality of river water by exploratory data analysis [J]. Water Research, 1998, 32(12): 3581-3592.

[10] SINGH K P, MALIK A, SINHA S. Water quality assessment and apportionment of pollution sources of Gomti River (India) using multivariate statistical techniques: A case study [J]. Analytica Chimica Acta, 2005, 35: 3581-3592.

[11] BU H M, TAN X, LI S Y, ZHANG Q F. Water quality assessment of the Jinshui River (China) using multivariate statistical techniques [J]. Environmental Earth Sciences, 2010, 60(8): 1631-1639.

[12] ALBERTO W D, PILAR D M D, VALERIA A M, FABIANA P S, CECILIA H A, ANGELES B M D L. Pattern recognition techniques for the evaluation of spatial and temporal variations in water quality. A case study: Squia River Basin (Cordoba-Argentina) [J]. Water Research, 2000, 35: 2881-2894.

[13] SINGH K P, MALIK A, MOHAN D, SINHA S. Multivariate statistical techniques for the evaluation of spatial and temporal variations in water quality of Gomti River (India)—A case study [J]. Water Research, 2004, 38: 3980-3992.

[14] MCNEIL V H, COX M E, PREDA M. Assessment of chemical water types and their spatial variation using multi-stage cluster analysis, Queensland, Australia [J]. Journal of Hydrology, 2005, 310 (1/2/3/4): 181-200.

[15] PANDA U C, SUNDARAY S K, RATH P, NAYAK B B, BHATTA D. Application of factor and cluster analysis for characterization of river and estuarine water systems—A case study: Mahanadi River (India) [J]. Journal of Hydrology, 2006, 331 (3/4): 434-445.

[16] SHRESTHA S, KAZAMA F. Assessment of surface water quality using multivariate statistical techniques: A case study of the Fuji river basin, Japan [J]. Environmental Modelling & Software, 2007, 22: 464-475.

[17] NAJAR I A, KHAN A B. Assessment of water quality and identification of pollution sources of three lakes in Kashmir, India, using multivariate analysis [J]. Environmental Earth Sciences, 2012, 66: 2367-2378.

[18] ZARE G A, SHEIKH V, SADODDIN A. Assessment of seasonal variations of chemical characteristics in surface water using multivariate statistical methods [J]. International Journal of Environmental Science and Technology, 2011, 8(3): 581-592.

[19] REGHUNATH R, MURTHY T R S, RAGHAVAN B R. The utility of multivariate statistical techniques in hydrogeochemical studies: An example from Karnataka, India [J]. Water Research, 2002, 36 (10): 2437-2442.

[20] WANG Y, WANG P, BAI Y J, TIAN Z X, LI J W, SHAO X, MUSTAVICH L F, LI B L. Assessment of surface water quality via multivariate statistical techniques: A case study of the Songhua River Harbin region, China [J]. Journal of Hydro-environment Research, 2013, 7(1): 30-40.

[21] YAN D H, DENG W, HE Y. The responses of hydro-environment system in the Second Songhua Basin to melt water [J]. Journal of Geographical Sciences, 2002, 12(3): 289-294.

[22] JIANG G B, SHI J B, FENG X B. Mercury pollution in China [J]. Environmental Science & Technology, 2006, 40(12): 3672-3678.

[23] HELENA B, PARDO R, VEGA M, BARRADO E, FERNANDEZ J M, FERNANDEZ L. Temporal evaluation of groundwater composition in an alluvial aquifer (Pisuerga river, Spain) by principal component analysis [J]. Water Research, 2000, 34: 807-816.

[24] MCKENNA Jr J E. An enhanced cluster analysis program with bootstrap significance testing for ecological community analysis [J]. Environmental Modelling & Software, 2003, 18(2): 205-220.

[25] WOLFGANG K H, S. Applied multivariate statistical analysis [M]. Third edition. Belin Heidelberg: Springer, 2012: 367-382.

[26] MASAMBA W R L, MAZVIMAVI D. Impact on water quality of land uses along Thamalakane-Boteti River: An outlet of the Okavango Delta [J]. Physics and Chemistry of the Earth, 2008, 33(8/9/10/11/12/13): 687-694.

[27] WANG X L, LU Y L, HAN J Y, HE G Z, WANG T Y. Identification of anthropogenic influences on water quality of rivers in Taihu watershed [J]. Journal of Environmental Sciences-China, 2007, 19(4): 475-481.

[28] LIU C W, LIN K H, KUO Y M. Application of factor analysis in the assessment of groundwater quality in a Blackfoot disease area in Taiwan [J]. The Science of the Total Environment, 2003, 313: 77-89.

[29] ZHOU F, LIU Y, GUO H C. Application of multivariate statistical methods to water quality assessment of the watercourses in Northwestern New Territories, Hong Kong [J]. Environmental Monitoring Assessment, 2007, 132(1/2/3): 1-13.

[30] BINI L M, THOMAZ S M, CARVALHO P. Limnological effects of Egeria najas Planchon (Hydrocharitaceae) in the arms of Itaipu Reservoir (Brazil, Paraguay) [J]. Limnology, 2010, 11(1): 39-47.

(Edited by YANG Bing)

Foundation item: Project(2012ZX07501002-001) supported by the Ministry of Science and Technology of China

Received date: 2015-03-02; Accepted date: 2015-06-05

Corresponding author: YU Hong-bing, Professor; E-mail: hongbingyunk@126.com

Abstract: Multivariate statistical techniques, such as cluster analysis (CA), discriminant analysis (DA), principal component analysis (PCA) and factor analysis (FA), were applied to evaluate and interpret the surface water quality data sets of the Second Songhua River (SSHR) basin in China, obtained during two years (2012-2013) of monitoring of 10 physicochemical parameters at 15 different sites. The results showed that most of physicochemical parameters varied significantly among the sampling sites. Three significant groups, highly polluted (HP), moderately polluted (MP) and less polluted (LP), of sampling sites were obtained through Hierarchical agglomerative CA on the basis of similarity of water quality characteristics. DA identified pH, F, DO, NH3-N, COD and VPhs were the most important parameters contributing to spatial variations of surface water quality. However, DA did not give a considerable data reduction (40% reduction). PCA/FA resulted in three, three and four latent factors explaining 70%, 62% and 71% of the total variance in water quality data sets of HP, MP and LP regions, respectively. FA revealed that the SSHR water chemistry was strongly affected by anthropogenic activities (point sources: industrial effluents and wastewater treatment plants; non-point sources: domestic sewage, livestock operations and agricultural activities) and natural processes (seasonal effect, and natural inputs). PCA/FA in the whole basin showed the best results for data reduction because it used only two parameters (about 80% reduction) as the most important parameters to explain 72% of the data variation. Thus, this work illustrated the utility of multivariate statistical techniques for analysis and interpretation of datasets and, in water quality assessment, identification of pollution sources/factors and understanding spatial variations in water quality for effective stream water quality management.

[1] NGOYE E, MACHIWA J F. The influence of land-use patterns in the Ruvu river watershed on water quality in the river system [J]. Physics and Chemistry of the Earth, 2004, 29(15/16/17/18): 1161-1166.

[2] BIERMAN P, LEWIS M, OSTENDORF B, TANNER J. A review of methods for analysing spatial and temporal patterns in coastal water quality [J]. Ecological Indicators, 2011, 11: 103-114.

[3] BHUIYAN M A H, RAKIB M A, DAMPARE S B, GANYAGLO S, SUZUKI S. Surface water quality assessment in the central part of Bangladesh using multivariate analysis [J]. KSCE Journal of Civil Engineering, 2011, 15(6): 995-1003.

[4] OUYANG Y, NKEDI-KIZZA P, WU Q T, SHINDE D, HUANG C H. Assessment of seasonal variations in surface water quality [J]. Water Research, 2006, 40: 3800-3810.

[5] SHAHIDUL I M, TANAKA M. Impacts of pollution on coastal and marine ecosystems including coastal and marine fisheries and approach for management: A review and synthesis [J]. Marine Pollution Bulletin, 2004, 48(7/8): 624-649.

[6] ORPIN A R, RIDD P V, THOMAS S, ANTHONY K R N, MARSHALL P, OLIVER J. Natural turbidity variability and weather forecasts in risk management of anthropogenic sediment discharge near sensitive environments [J]. Marine Pollution Bulletin, 2004, 49 (7/8): 602-612.

[7] SANCHEZ E, COLMENAREJO M F, VICENTE J, RUBIO A, GARCIA M G, TRAVIESO L, BORJA R. Use of the water quality index and dissolved oxygen deficit as simple indicators of watersheds pollution [J]. Ecological Indicators, 2007, 7(2): 315-328.

[8] DIXON W, CHISWELL B. Review of aquatic monitoring program design [J]. Water Research, 1996, 30: 1935-1948.

[9] VEGA M, PARDO R, BARRADO E, DEBAN L. Assessment of seasonal and polluting effects on the quality of river water by exploratory data analysis [J]. Water Research, 1998, 32(12): 3581-3592.

[10] SINGH K P, MALIK A, SINHA S. Water quality assessment and apportionment of pollution sources of Gomti River (India) using multivariate statistical techniques: A case study [J]. Analytica Chimica Acta, 2005, 35: 3581-3592.

[11] BU H M, TAN X, LI S Y, ZHANG Q F. Water quality assessment of the Jinshui River (China) using multivariate statistical techniques [J]. Environmental Earth Sciences, 2010, 60(8): 1631-1639.

[12] ALBERTO W D, PILAR D M D, VALERIA A M, FABIANA P S, CECILIA H A, ANGELES B M D L. Pattern recognition techniques for the evaluation of spatial and temporal variations in water quality. A case study: Squia River Basin (Cordoba-Argentina) [J]. Water Research, 2000, 35: 2881-2894.

[13] SINGH K P, MALIK A, MOHAN D, SINHA S. Multivariate statistical techniques for the evaluation of spatial and temporal variations in water quality of Gomti River (India)—A case study [J]. Water Research, 2004, 38: 3980-3992.

[14] MCNEIL V H, COX M E, PREDA M. Assessment of chemical water types and their spatial variation using multi-stage cluster analysis, Queensland, Australia [J]. Journal of Hydrology, 2005, 310 (1/2/3/4): 181-200.

[15] PANDA U C, SUNDARAY S K, RATH P, NAYAK B B, BHATTA D. Application of factor and cluster analysis for characterization of river and estuarine water systems—A case study: Mahanadi River (India) [J]. Journal of Hydrology, 2006, 331 (3/4): 434-445.

[16] SHRESTHA S, KAZAMA F. Assessment of surface water quality using multivariate statistical techniques: A case study of the Fuji river basin, Japan [J]. Environmental Modelling & Software, 2007, 22: 464-475.

[17] NAJAR I A, KHAN A B. Assessment of water quality and identification of pollution sources of three lakes in Kashmir, India, using multivariate analysis [J]. Environmental Earth Sciences, 2012, 66: 2367-2378.

[18] ZARE G A, SHEIKH V, SADODDIN A. Assessment of seasonal variations of chemical characteristics in surface water using multivariate statistical methods [J]. International Journal of Environmental Science and Technology, 2011, 8(3): 581-592.

[19] REGHUNATH R, MURTHY T R S, RAGHAVAN B R. The utility of multivariate statistical techniques in hydrogeochemical studies: An example from Karnataka, India [J]. Water Research, 2002, 36 (10): 2437-2442.

[20] WANG Y, WANG P, BAI Y J, TIAN Z X, LI J W, SHAO X, MUSTAVICH L F, LI B L. Assessment of surface water quality via multivariate statistical techniques: A case study of the Songhua River Harbin region, China [J]. Journal of Hydro-environment Research, 2013, 7(1): 30-40.

[21] YAN D H, DENG W, HE Y. The responses of hydro-environment system in the Second Songhua Basin to melt water [J]. Journal of Geographical Sciences, 2002, 12(3): 289-294.

[22] JIANG G B, SHI J B, FENG X B. Mercury pollution in China [J]. Environmental Science & Technology, 2006, 40(12): 3672-3678.

[23] HELENA B, PARDO R, VEGA M, BARRADO E, FERNANDEZ J M, FERNANDEZ L. Temporal evaluation of groundwater composition in an alluvial aquifer (Pisuerga river, Spain) by principal component analysis [J]. Water Research, 2000, 34: 807-816.

[24] MCKENNA Jr J E. An enhanced cluster analysis program with bootstrap significance testing for ecological community analysis [J]. Environmental Modelling & Software, 2003, 18(2): 205-220.

[25] WOLFGANG K H, S. Applied multivariate statistical analysis [M]. Third edition. Belin Heidelberg: Springer, 2012: 367-382.

[26] MASAMBA W R L, MAZVIMAVI D. Impact on water quality of land uses along Thamalakane-Boteti River: An outlet of the Okavango Delta [J]. Physics and Chemistry of the Earth, 2008, 33(8/9/10/11/12/13): 687-694.

[27] WANG X L, LU Y L, HAN J Y, HE G Z, WANG T Y. Identification of anthropogenic influences on water quality of rivers in Taihu watershed [J]. Journal of Environmental Sciences-China, 2007, 19(4): 475-481.

[28] LIU C W, LIN K H, KUO Y M. Application of factor analysis in the assessment of groundwater quality in a Blackfoot disease area in Taiwan [J]. The Science of the Total Environment, 2003, 313: 77-89.

[29] ZHOU F, LIU Y, GUO H C. Application of multivariate statistical methods to water quality assessment of the watercourses in Northwestern New Territories, Hong Kong [J]. Environmental Monitoring Assessment, 2007, 132(1/2/3): 1-13.

[30] BINI L M, THOMAZ S M, CARVALHO P. Limnological effects of Egeria najas Planchon (Hydrocharitaceae) in the arms of Itaipu Reservoir (Brazil, Paraguay) [J]. Limnology, 2010, 11(1): 39-47.