简介概要

Multi-core optimization for conjugate gradient benchmark onheterogeneous processors

来源期刊：中南大学学报(英文版)2011年第2期

论文作者：邓林窦勇

文章页码：490 - 498

Key words：multi-core processor; NAS parallelization; CG; memory optimization

Abstract: Developing parallel applications on heterogeneous processors is facing the challenges of ‘memory wall’, due to limited capacity of local storage, limited bandwidth and long latency for memory access. Aiming at this problem, a parallelization approach was proposed with six memory optimization schemes for CG, four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20, the parallelization approach can reach up to 21 and 133 times speedups with size A and B, respectively, compared with single power processor element. Finally, the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV, simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.

J. Cent. South Univ. Technol. (2011) 18: 490-498

DOI: 10.1007/s11771-011-0722-6

Multi-core optimization for conjugate gradient benchmark on heterogeneous processors

DENG Lin(邓林), DOU Yong(窦勇)

National Laboratory for Parallel and Distributed Processing, National University of Defense Technology,

Changsha 410073, China

? Central South University Press and Springer-Verlag Berlin Heidelberg 2011

Key words: multi-core processor; NAS parallelization; CG; memory optimization

1 Introduction

Multi-core architecture has become a promising approach for high performance computing. The Cell BE processor belongs to a kind of heterogeneous multi-core processor, which is composed of a power processor element (PPE) and eight synergistic processor elements (SPEs). The theoretical peak performance of the cell processor reaches 200 GFLOPS (single precision) with clock frequency of 3.2 GHz. It supports peak bandwidth of 204.8 GB/s intra-chip and 25.6 GB/s outer-chip data transfers. As the host core, PPE can access the main memory directly, run the operating system and manage the applications. In contrast, as a slave core, each SPE only operates directly on its local storage (LS) memory, and works as an application accelerator in the SIMD mode. The Cell BE has attracted much attention in various research communities, such as application performance optimization, programming models compilation, and performance. In terms of application optimization, PETRINI et al [1] presented several dimensions of parallelism for sweep3D. BADER D et al [2] built a complex model for Cell BE and optimized the list ranking algorithm. HACKENBERG ported RAxML to the Cell BE via separating the original matrix into 64×64 small matrixes, and then computed all the small matrix in SPEs. SARANTA et al [3] proposed an independent task scheduling method with GA heuristics on heterogeneous system.

CG, one of the benchmarks in NAS, uses a conjugate gradient method to compute an approximation to the smallest eigenvalue of a large, sparse and symmetric positive definite matrix. The key portion of CG is sparse matrix-vector multiplication (SPMV). Many scientists have involved in researches on SPMV for several decades. SPMV has been accelerated with field programmable gate array (FPGA) [7]. Most of them were designed based on the assumption that the capacity of memory in FPGA is enough to hold the whole vector, which is limited by the scalability of SPMV. On the other hand, software optimization can accelerate SPMV with two methods. Firstly, some researchers improve the SPMV algorithm with new storage format and compression technology for efficient memory access. HEATH et al [8] proposed a new data structure to store the dense vector, which consumed more capacity of memory and needed to rewrite the SPMV algorithm. STATHIS et al [9] presented a new storage format of sparse matrix on vector processor. AZEVEDO et al [10] exploited time-space tradeoff method via compression technology with huge consumption of computing resource for compression and decompression. All the optimization methods above aimed at reducing memory requirement, and improving the performance by lessening the memory access. These methods can work efficiently on general processor with transparent cache, but not on heterogeneous processor with explicit control memory, like Cell BE. On heterogeneous processor, the control complexity will be increased by employing new storage format and compression technology. The second one is the optimization of SPMV on different platforms with different system architectures, but most of them are only focused on the computational optimization instead of memory optimization. For example, COTOFANA et al [11] optimized SPMV on vector processor. Clusters or SMP systems were also used to operate SPMV [12-14]. In the last few years, with multi-core processor coming into vogue, WILLIAM et al [15], CHRISTEN et al [16] and WIGGERS et al [17] ported SPMV to several multi-core processor platforms.

In this work, four memory optimization schemes are proposed to improve SPMV computation on heterogeneous processors.

2 Optimization schemes for multi-core NAS CG benchmark

The subroutine conj_grad is the most time-consuming part in CG, which consumes about 99.6% execution time for size A. As a result, conj_grad subroutine should be given the priority. The flow chat of conj_grad is shown in Fig.1(a). A is a sparse matrix, p, p, z and r are vectors. The size of A is N×N and the size of all vectors is N.

In Cell BE, single precision instruction can be fully pipelined, thus the type of matrix element is transformed to single precision.

2.1 Parallelizing sparse matrix vector multiplication

2.1.1 Scheme 1: Blocking sparse matrix

Due to the fact that SPEs can only access own LS and the data must be placed in LS via DMA transfer before calculating, sparse matrix A is separated into several blocks to increase the efficiency of DMA transfer. The number of blocks should be equal to the number of SPEs participating the calculation. There are two blocking methods as follows.

1) Row-oriented blocking method

Fig.1 Flow chat of conj_grad routine: (a) Serial version; (b) Parallel version

Row-oriented blocking method is to block sparse matrix through row. Every SPE is responsible for [N/n_spe] rows, where N is the number of rows of A, and n_spe is the number of participant SPEs. In other words, A is separated into n_spe blocks, namely A₀ to and then SPE_i (where 0≤i≤n_spe-1) takes the calculation of q_i=A_i×p. As shown in Fig.2(a), the result of q_i can be got after the calculation of every row of A_i. Unfortunately, it is needed to access all the elements of vector p, which cannot be held in LS. There are two ways to solve this problem. The first one is importing all elements of p at each row calculation. As a result, [N/n_spe]-times vector p must be imported for each SPE. That is to say, N×n_spe× N×4=4N²×n_spe bytes will be imported from memory each time. The second method is that vector p is separated into n_spe blocks, namely p₀, …, p_i is kept in LS of SPE_i via intra-chip network and can be transferred among SPEs. As a result, p must be imported from memory one time with 4N² bytes, and totally (N-[N/n_spe])-times p are transferred among SPEs with 4N²(n_spe-1) bytes. Compared with the first method, this method can reduce 4N²(n_spe-1) bytes data transfer and replace outer-chip communication with intra-chip communication which has 10-times bandwidth than the former. So, the second method is better.

Both of the methods described above need to transfer p many times between SPE and memory or among SPEs with DMA. Although the second method exploits faster intra-chip transfer instead of outer-chip transfer, it makes the control complex.

In order to further reduce data transfers of p, A_i is separated into m blocks by column: A_i₀, …, A_im_-1, as shown in Fig.2(b). p should be split into m blocks, …, All SPEs can calculate q_ij=A_ij× simultaneously, since they only need the elements of . is imported from memory to LS of SPE0, and then transferred to other SPEs via intra-chip transfer. Thus, 4N² bytes are needed to be transferred from memory, and (n_spe-1)×N×4 bytes among SPEs, which means that data transfers can be reduced from O(N²) to O(N) by rearranging elements of A_i and at most 4×[N/n_spe] bytes space can be used to store temporary results.

2) Column-oriented blocking method

In contrast, the column-oriented blocking method separates the sparse matrix A by column into A₀, …, as shown in Fig.2(c). Vector p is split into n_spe blocks too. SPE_i is in charge of computingA_i×p_i,

where is the temporary result of q_i, i.e.

p_i is held in LS of SPE_i during SPE execution, and then no more transfers for p_i are needed because the calculation of SPE_i only depends on p_i and A_i. Consequently, 4N bytes of p are imported from the memory only once for a sparse matrix multiplication.

In the case of DMA transfer, non-zero elements of sparse matrix A must be rearranged, as shown in Fig.2(c). The rearranging procedure will increase the execution time of the program, but fortunately, the rearranging procedure is only done once because A is unchanged during the whole process. Compared with the best case in the row-oriented blocking method, this method reduces (n_spe-2)×N×4 bytes of data transfer for p.

After finishing the calculation of temporary result

, must be executed to get the final result.

The job of calculating the final result is assigned to PPE, which is in idle status while SPEs are working. On the basis of conclusion of DMA transfer analysis presented in Ref.[18], n_q elements of temporary result are packed as a group and one group is exported each time for the purpose of efficiency of DMA transfer, where 32≤n_q≤ 4 096. Once finishing n_q elements calculation, a group of temporary results will be exported. SPEs note PPE to do the procedure of summation immediately after completing exporting a group of temporary results. As a result, PPE and SPEs can work in parallel style.

In summary, the column-oriented blocking method is better on reducing DMA transfer and has more parallelism.

2.1.2 Scheme 2: Changing store format of sparse matrix

In the original code, the sparse matrix is stored in compressed sparse row (CSR) format. The CSR storage scheme uses three arrays to describe a sparse matrix. The array V_val holds only the values of non-zero elements of the sparse matrix. The array V_colidx holds the column indices corresponding to elements in V_val. The array V_rowsr holds the starting indices of the rows, i.e. the first index of the rows of non-zero element.

The CSR format in the original code consumes N_nnz×4+N_nnz×4+N×4 bytes of memory, where N_nnz is the number of non-zero elements of sparse matrix A. In the general CPU with transparent cache, this format requests lower memory capacity and less memory access. But it does not work efficiently on heterogeneous multi-core processor with explicit control memory, because many penalty latencies are caused by mispredicted control instructions.

According to the column-oriented blocking scheme, every SPE_i has its own array V_rowstr, and the temporary result is

(1)

Let r_now=V_rowstr[i], r_next=V_rowstr[i+1]. Array V_rowstr is separated into [N/b_m] blocks, i.e. b_row[i], 0≤i≤[N/b_m]-1, where b_m is the number of elements transferred to LS via DMA each time. At the same time, V_val and V_vcolidx are split into [N_nnz/b_vcn] blocks, such as b_val[i] and b_col[i], where 0≤i≤[N_nnz/b_vcn]-1, and b_vcn is the number of elements of V_val and V_colidx transferred each time.

The values of V_now and r_next are different in the following conditions.

1) Firstly, If 0≤i≤b_m, then r_now=V_rowstr[i] and r_next= V_rowstr[i+1], as shown in Fig.3(a).

2) Secondly, if i=b_m-1, in another word, V_rowstr[i] is the last element of b_row[j], as shown in Fig.3(b), then the following steps should be done:

(1) r_now=r_owstr[b_m-1]

(2) Import next b_row, b_row[j+1]

(3) r_next=v_rowst[0].

The computation of Eq.(1) is referred to the values of r_now and r_next:

1) Firstly, if r_now=r_next, then q′[i]=0 and i=i+1.

2) Secondly, if r_now≠r_next, then

(1) when |r_now/b_m|=|r_next/b_m|, as shown in Fig.4(a), V_val[r_now] and V_val[r_next] belong to the same block b_val[j], and V_colidx[r_now] and V_colidx[r_next] belong to the same b_col[j] at the same time, and there is

(2)

(2) When |r_now/b_m|>|r_next/b_m|, as shown in Fig.4(b), V_val[r_now] and V_colidx[r_now] belong to b_val[j] and b_col[j], respectively, and V_val[r_next] and V_colidx[r_next] belong to b_val[j+k] and b_col[j+k], respectively. Therefore b_val[j+1], …, b_val[j+k] and b_col[j+1], …, b_col[j+k] have to be transferred one after another. As a result, there are several steps for calculating q[i]:

Import b_val[j+1] and b_col[j+1]

…

Import b_val[j+k] and b_col[j+k]

Fig.2 Blocking operation chart: (a) Row-oriented; (b) Improved row-oriented; (c) Column-oriented

Fig.3 Diagrams of different values of r_now and r_next according to i: (a) Values of r_now and r_next while 0≤i≤b_m; (b) Values of r_now and r_next while i=b_m-1

From the discussion above, the conclusion may be safely drawn that we have to appeal a lot of control instructions to distinguish different conditions. In order to lower the control complexity, MM format is used to store sparse matrix instead of CSR format.

In MM format, the size of array V_rowstr and V_rowidx are expanded to N_nnz, and then split into [N_nnz/b_vcn] blocks, that is b_row[i], where 0≤i≤[N_nnz/b_vcn]-1. As a result, the computation for MM format is

q(b_row[i])=q(b_row[i])+(b_vol[i]×p(b_col[i])) (3)

In short, it is more efficient that computing SPMV in SPE with MM format, due to lower control complexity. Comparing with CSR format with several different computational procedures, there is only one computational procedure with MM format. The scheme of substituting MM format for CSR format is a space-time tradeoff method, which costs more (4N_nnz-4N) bytes of memory and DMA transfer.

2.1.3 Scheme 3: Overlapping computation with data transfer by 2-buffer

Computation can be overlapped with data transfer by 2-buffer scheme. The computation and data transfer in deferent buffers can be performed simultaneously to hide the computation of data transfer latency. 2-buffer method is employed to both import A_i and export q′_i during computation. Since the computation of indirect vector p_i occupies a majority of execution time in sparse matrix, DMA transfer latency can be hidden by computation.

2.1.4 Scheme 4: Overlapping LS access latency with loop-unrolling

SPE must load data from LS to register before calculating, and then write the results back to LS. The time of computation is

t_calc=t_LS+t_register

where t_LS presents the time of LS access by LS instruction, and t_register indicates the execution time of calculation by register instruction.

In SPE, registers all have 128 bits, and operations are all in SIMD. If we operate scalar data in SPE, we have to rotate the scalar data into the preferred scalar slot, the left-most word of a register, via rotating instruction at first. Similarly, the scalar data must be shuffled to the storage location by shuffle instruction before writing- back operation.

The latency of each instruction is shown in Table 1.

Table 1 Characteristics of scalar operation instructions

Due to the indirect accesses of q and p, SIMD technology cannot be employed for SPMV computation on SPE. SPE can issue a LS access instruction and a register instruction simultaneously while those instructions are data independent. According to the discussion above, the SPMV computational procedure in SPE is shown in Fig.5 via assembly-like language.

The ‘Start time’ is the earliest time of an instruction that can be issued. Obviously, instructions cannot be issued continually because of instruction execution latency and data dependence. For example, the instruction ‘Rotate b_row’ cannot be issued until instruction ‘Load b_row’ has finished due to the data dependence of b_row.

Fig.5 Computational procedure of SPMV on SPE

In Fig.5, it consumes about 40 cycles to compute Eq.(3), and there are about 18 cycles among LS instructions caused by the data dependence, i.e. 9 cycles between ‘Load b_val’ and ‘Load q’ and 9 cycles between ‘Load p’ and ‘Store q’. In another word, the stalling time is about 45% of the total time. Loop unrolling technology is utilized to overlap the stalling time with instructions that belong to other iterations. As a result, the computation time can be reduced to 25 cycles averagely by 4 times using loop-unrolling.

2.2 Prioritized storage model for reusable data

The definition of symbols used in the model is shown in Table 2.

Table 2 Definition of symbols used in model

In heterogeneous processor, the time of a program execution is composed of calculating time, data transfer time and control instruction execution time:

t=t_calc+t_data+t_control (4)

Data transfer can be reduced via storing reusable data in LS. As shown in the flow chat, there are several reusable data among iterations, such as vectors p, q, r, z and sparse matrix A. These data have to be imported from memory before calculating and exported to memory after calculating via DMA transfer because of lack of capacity of LS. So, data transfer time becomes

t_data=t_ml+t_lm (5)

where t_ml is the time of importing from memory to LS and t_lm vice versa.

(6)

where t_dma is the time of a DMA transfer of size b_ml, and it is a function of b_ml and n_spe [18].

T_ml is the size of data needed to be imported. Similarly, T_lm is the size of data needed to be exported. As a result, the total size of data transfer is

(7)

where T_ml,i is the import data size in the i-th iteration and T_lm,i vice versa. It is found that T_ml,i is different between the first iteration and the others.

1) In the first iteration, there is

(8)

where import whole sparse matrix A at step 1;

import whole vector p at step 1 and import the transfer part of p at step 2, 4 and 8;

import the transfer part of q at step 2 and 5;

import whole vector r at step 5 and import the transfer part of r at step 6 and 8;

import whole vector z at step 4.

So,

(9)

2) In other iterations, there is

2≤i≤t_i (10)

According to the blocking scheme described above, q′_i, the temporary results of q_i, must be exported in every iteration as So,

(11)

(12)

Suppose t_i>>1, and let D′=D_A+D_p+D_r+D_z+n_spe×D_q, then

(13)

According to Eq.(13), the storage priority for those reusable data is defined as p:q:r:z:A=5:2:4:2:1. Then, the capacity of LS is allocated to satisfy the higher storage priority data. There are seven allocating conditions:

(1) If ≤

LS is large enough to store all the reusable data. So T=D′ and

(2) If ≤ and

>, sparse

matrix A has to be transferred in every iteration, and then

(3) If and z can be stored in

because A and z are not simultaneous required, so

(4) If and then

(5) If two conditions should be considered: and then

(6) If and then

7) If , then

According to the result of DMA transfer analysis [18], the size of DMA transfer should be set to be multiple times of 128 bytes and lager than 2 kB to obtain the peak performance of DMA transfer. Let 16 kB be the upper limit of DMA transfer, and 2 kB≤s≤16 kB and s= 128x are defined, where To employ 2-buffer scheme, 4 kB≤s≤32 kB and s=128x are defined.

Three arrays are used to store the sparse matrix A, with 12 bytes for one element. As a result, should be multiple times of 12, that is, By considering DMA transfer characteristics and 2-buffer scheme, 4.5 kB≤≤96 kB and are defined.

2.3 Summation by PPE

Step 2 and step 6 of subroutine conj_grad are assigned to PPE, which is in idle status while SPEs are working.

In step 2, d=d+p×q is computed, and the scalar d depends on SPMV result q which is summed by with PPE part by part, as described in blocking scheme. As a result, d can be computed when the computation of partial result of q_i is completed. Likewise, step 6, the computation of rho=rho+r×r can also be processed in the same way.

After the optimization above, the optimized parallelization strategy of conj_grad can be got, as shown in Fig.1(b).

3 Performance evaluation

Experiments are performed on an IBM QS20 server that contains two chips of Cell BE at 3.2 GHz and 512 MB Rambus Dual XDR memory. Fedora Core 7 Linux, with 2.6.22 kernel and program lib cellsdk 3.0, is stalled. Because of the limitation in memory, only configurations A and B are considered here. In configuration of A, the size of sparse matrix is 1.4×10⁴×1.4×10⁴, the times of iterations is 15 and the number of non-zero elements in each row is 11. For configuration of B, the parameters are 7.5×10⁴×7.5×10⁴, 75 and 13, respectively. When running on single PPE, the execution time of CG for size A is 25.23 s, and fore size B it is 3.160 22×10³ s.

3.1 Comparison of CSR and MM storage format

Table 3 shows the performance comparison between CSR and MM format. The ‘Memory access time’ includes the time of importing vector p and sparse matrix A and exporting result. The ‘Control time’ represents the time consumed by the control instructions during execution. The ‘Computation time’ stands for the time of SPE computation. The ‘SPMV time’ is the time of a SPMV procedure.

As shown in Table 3, the time of data import with CSR format is about 2/3 that of MM format since space consuming is about 2/3 that with CSR formant correspondingly. The computational complexities of both CSR and MM formats are basically the same, so they have the same computation time. But the control flow of the CSR is more complex. The more complex the control flow is, the more the condition branch instructions are needed. Because of 18-cycle penalty for branch misprediction, the time of control instructions with CSR format is about 5.35 times that of MM format. Therefore, MM format is better than CSR format.

3.2 Performance evaluation with various

According to the prioritized storage model and considering size B, 146.5 kB is assigned to store all elements of vector p, q, r and z in each LS. The code size is about 20 kB based on experience. At last, we can assign 96 kB to should be multiple times of 384 bytes and ≥45 kB to achieve peak bandwidth of DMA transfer.

As shown in Fig.6, the execution time of CG is decreased along with size of increasing. When≥45 kB, the same performance is obtained as =96 kB. In another word, =4.5 kB is enough for the storage model.

Table 3 Performance comparison of CSR and MM storage format

Fig.6 Performance evaluation for size of

3.3 Performance evaluation for each scheme

Fig.7 and Fig.8 show the performance evaluation of each optimization scheme.

In Figs.7(a) and (b) , the total time of CG is reduced, along with the SPMV time reducing with blocking sparse matrix, changed storage format, 2-buffer and loop- unrolling scheme. Through the former four schemes, the SPMV time is reduced from 6.5×10^-2 to 2.81×10^-3 s for size A and from 1.64 to 1.9×10^-2 s for size B. At the same time, the total time is reduced from 25.23 to 1.160 s and from 3.0891 9×10³ to 23.09 s, respectively. The SPMV time cannot be reduced with the last two schemes, but the performance of CG can be improved with them because they reduce the data transfers and exploit the parallelism of PPE and SPEs. Finally, with the later two schemes, the total time of CG is reduced from 1.745 to 1.160 s and from 45.18 to 23.09 s for sizes A and B, respectively.

In 2-buffer scheme, the time can be reduce by about 9.1×10^-4 and 1.02×10^-2 s, respectively, which is basically equal to the sum of import time and export time in Fig.7. In the loop-unrolling scheme, the computation time can be reduced to 2.82×10^-2 and 1.89×10^-2 s for sizes A and B, respectively.

As shown in Fig.8, although executing on PPE only, there are about 1.31- and 2.46-times speedups, respectively, for sizes A and B with blocking sparse matrix scheme compared with the non-optimized code, which indicates that the blocking scheme can increase the cache hit ratio and reduce the memory access. Changing storage format scheme can bring about 10.07- and 38.48-times speedups, respectively. As a result, 21.75- and 133.79-times speedups can be obtained, respectively, compared with those of the non-optimized code.

Fig.7 SPMV time and total time for each scheme: (a) Size A; (b) Size B

Fig.8 Speedup for each scheme: (a) Size A; (b) Size B

4 Conclusions

1) A multi-core version of CG program for heterogeneous multi-core processor is developed. Six types of memory optimization schemes are performed in the multi-core version. Four schemes of them aim at all kinds of SPMV operation on heterogeneous processors, and peak bandwidth of memory access on Cell BE can be obtained.

2) The influence of control instructions on performance is also analyzed. It is found that a simple computation is more useful than the less memory access with complicated control procedure.

3) Loop-unrolling is used to hide LS access latency. Loop-unrolling works efficiently when running a scalar program on SIMD processor.

4) In order to reduce reusable data transfer and take full advantage of memory access bandwidth, a prioritized storage model for reusable data among iterations is built.

5) Compared with single PPE, the speedups of 21.75 and 133.79 times can be achieved for sizes A and B, respectively.

References

[1] PETRINI F, FOSSUM G, FERNANDEZ J, KISTLER M, PERRONE M. Multicore surprise: Lessons learned from optimizing sweep3D on the cell broadband engine [C]// IPDPS. California, 2007: 62.

[2] BADER D, AGARWAL V, MADDURI K. On the design and analysis of irregular algorithms on the cell processor: A case study of list ranking [C]// IPDPS. California, 2007: 76.

[3] SARANTA S, REVATHI S, CHITRA P. Scheduling independent tasks on heterogeneous distributed computing systems using multiobjective optimization approach on multicore processors [C]// ACT. 2009: 481-483.

[4] CARTER J, HSIEH W, SWANSON M, ZHANG Li-xin. Memory system support for irregular applications [J]. LNCS, 1998, 1511: 17-26.

[5] CARTER J, HSIEH W, SWANSON M, ZHANG Li-xin. Impulse: Building a smarter memory controller [C]// HPCA. Orlando, 1999: 70-79.

[6] KIM D, CHAUDHURI M. Architectural support for uniprocessor and multiprocessor active memory systems [J]. IEEE Transactions on Computers, 2004, 53(3): 288-307.

[7] MORRIS G, PRASANNA V. Sparse matrix computations on reconfigurable hardware [J]. IEEE Transactions on Computers, 2007, 40(3): 58-64.

[8] HEATH, PINAR A, MICHAEL T. Improving performance of sparse matrix-vector Multiplication [C]// ACM/IEEE SC Conference. Portland, 1999: 30.

[9] STATHIS P, VASSILIADIS S, COTIFANA S. A hierarchical sparse matrix storage format for vector Processors [C]// IPDPS. Santa Fe, 2004: 61a.

[10] AZEVEDO E, FAHEY M, MILLS R. Vectorized sparse matrix multiply for compressed row storage format [C]// ICCS. Atlanta, 2005: 99-106.

[11] COTOFANA S, STATHIS P, VASSILIADIS S. Direct and transposed sparse matrix-vector multiplication [C]// MPCS. Los Alamitos, 2002: 1?9.

[12] HENDRICKSON B, LELAND R, PLIMPTON S. An efficient parallel algorithm for matrix-vector multiplication [J]. High Speed Compute, 1995, 7(1): 73-88.

[13] ZIAVRAS, JIN De-jiang, SOTIRIOS G. A super-programming technique for large sparse matrix multiplication on pc clusters [J]. IEICE, 2004, E87(D): 1774-1781.

[14] BISSELING R H, MESSSEN W. Communication balancing in parallel sparse matrix-vector multiplication [J]. Electronic Transactions on Numerical Analysis, 2005, 21: 47-65.

[15] WILLIAMS, DEMMEL S, VUDUC, LEONID R Optimization of sparse matrix-vector multiplication on emerging Multicore Platforms [C]// SC. Nevadaz, 2007: 178-194.

[16] CHRISTEN M, SCHENK O, BURKHART H. General-purpose sparse matrix building blocks general-purpose sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform [C]// GPGPU. Boston, 2007: 38-46.

[17] SMIT, WIGGERS W, BAKKER V, KOKKELER A. Implementing the conjugate gradient algorithm on multi-core systems [C]// Tampere: SOC, 2007: 1-4.

[18] DOU Yong, DENG Lin. DMA performance analysis and multicore memory optimization for SWIM benchmark on the Cell Processor [C]// ISPA. Sydney, 2008: 170-179.

(Edited by YANG Bing)

Foundation item: Project(2008AA01A201) supported the National High-tech Research and Development Program of China; Projects(60833004, 60633050) supported by the National Natural Science Foundation of China

Received date: 2010-07-07; Accepted date: 2010-12-27

Corresponding author: DENG Lin, PhD Candidate; Tel: +86-13975113732; E-mail: denglin@nudt.edu.cn

简介概要

详情信息展示

Multi-core optimization for conjugate gradient benchmark on heterogeneous processors

相关论文

相关知识点