Multi-core optimization for conjugate gradient benchmark onheterogeneous processors
来源期刊:中南大学学报(英文版)2011年第2期
论文作者:邓林 窦勇
文章页码:490 - 498
Key words:multi-core processor; NAS parallelization; CG; memory optimization
Abstract: Developing parallel applications on heterogeneous processors is facing the challenges of ‘memory wall’, due to limited capacity of local storage, limited bandwidth and long latency for memory access. Aiming at this problem, a parallelization approach was proposed with six memory optimization schemes for CG, four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20, the parallelization approach can reach up to 21 and 133 times speedups with size A and B, respectively, compared with single power processor element. Finally, the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV, simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.
J. Cent. South Univ. Technol. (2011) 18: 490-498
DOI: 10.1007/s11771-011-0722-6
DENG Lin(邓林), DOU Yong(窦勇)
National Laboratory for Parallel and Distributed Processing, National University of Defense Technology,
Changsha 410073, China
? Central South University Press and Springer-Verlag Berlin Heidelberg 2011
Abstract: Developing parallel applications on heterogeneous processors is facing the challenges of ‘memory wall’, due to limited capacity of local storage, limited bandwidth and long latency for memory access. Aiming at this problem, a parallelization approach was proposed with six memory optimization schemes for CG, four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20, the parallelization approach can reach up to 21 and 133 times speedups with size A and B, respectively, compared with single power processor element. Finally, the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV, simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.
Key words: multi-core processor; NAS parallelization; CG; memory optimization
1 Introduction
Multi-core architecture has become a promising approach for high performance computing. The Cell BE processor belongs to a kind of heterogeneous multi-core processor, which is composed of a power processor element (PPE) and eight synergistic processor elements (SPEs). The theoretical peak performance of the cell processor reaches 200 GFLOPS (single precision) with clock frequency of 3.2 GHz. It supports peak bandwidth of 204.8 GB/s intra-chip and 25.6 GB/s outer-chip data transfers. As the host core, PPE can access the main memory directly, run the operating system and manage the applications. In contrast, as a slave core, each SPE only operates directly on its local storage (LS) memory, and works as an application accelerator in the SIMD mode. The Cell BE has attracted much attention in various research communities, such as application performance optimization, programming models compilation, and performance. In terms of application optimization, PETRINI et al [1] presented several dimensions of parallelism for sweep3D. BADER D et al [2] built a complex model for Cell BE and optimized the list ranking algorithm. HACKENBERG ported RAxML to the Cell BE via separating the original matrix into 64×64 small matrixes, and then computed all the small matrix in SPEs. SARANTA et al [3] proposed an independent task scheduling method with GA heuristics on heterogeneous system.
CG, one of the benchmarks in NAS, uses a conjugate gradient method to compute an approximation to the smallest eigenvalue of a large, sparse and symmetric positive definite matrix. The key portion of CG is sparse matrix-vector multiplication (SPMV). Many scientists have involved in researches on SPMV for several decades. SPMV has been accelerated with field programmable gate array (FPGA) [7]. Most of them were designed based on the assumption that the capacity of memory in FPGA is enough to hold the whole vector, which is limited by the scalability of SPMV. On the other hand, software optimization can accelerate SPMV with two methods. Firstly, some researchers improve the SPMV algorithm with new storage format and compression technology for efficient memory access. HEATH et al [8] proposed a new data structure to store the dense vector, which consumed more capacity of memory and needed to rewrite the SPMV algorithm. STATHIS et al [9] presented a new storage format of sparse matrix on vector processor. AZEVEDO et al [10] exploited time-space tradeoff method via compression technology with huge consumption of computing resource for compression and decompression. All the optimization methods above aimed at reducing memory requirement, and improving the performance by lessening the memory access. These methods can work efficiently on general processor with transparent cache, but not on heterogeneous processor with explicit control memory, like Cell BE. On heterogeneous processor, the control complexity will be increased by employing new storage format and compression technology. The second one is the optimization of SPMV on different platforms with different system architectures, but most of them are only focused on the computational optimization instead of memory optimization. For example, COTOFANA et al [11] optimized SPMV on vector processor. Clusters or SMP systems were also used to operate SPMV [12-14]. In the last few years, with multi-core processor coming into vogue, WILLIAM et al [15], CHRISTEN et al [16] and WIGGERS et al [17] ported SPMV to several multi-core processor platforms.
In this work, four memory optimization schemes are proposed to improve SPMV computation on heterogeneous processors.
2 Optimization schemes for multi-core NAS CG benchmark
The subroutine conj_grad is the most time-consuming part in CG, which consumes about 99.6% execution time for size A. As a result, conj_grad subroutine should be given the priority. The flow chat of conj_grad is shown in Fig.1(a). A is a sparse matrix, p, p, z and r are vectors. The size of A is N×N and the size of all vectors is N.
In Cell BE, single precision instruction can be fully pipelined, thus the type of matrix element is transformed to single precision.
2.1 Parallelizing sparse matrix vector multiplication
2.1.1 Scheme 1: Blocking sparse matrix
Due to the fact that SPEs can only access own LS and the data must be placed in LS via DMA transfer before calculating, sparse matrix A is separated into several blocks to increase the efficiency of DMA transfer. The number of blocks should be equal to the number of SPEs participating the calculation. There are two blocking methods as follows.
1) Row-oriented blocking method
Fig.1 Flow chat of conj_grad routine: (a) Serial version; (b) Parallel version
Row-oriented blocking method is to block sparse matrix through row. Every SPE is responsible for [N/nspe] rows, where N is the number of rows of A, and nspe is the number of participant SPEs. In other words, A is separated into nspe blocks, namely A0 to and then SPEi (where 0≤i≤nspe-1) takes the calculation of qi=Ai×p. As shown in Fig.2(a), the result of qi can be got after the calculation of every row of Ai. Unfortunately, it is needed to access all the elements of vector p, which cannot be held in LS. There are two ways to solve this problem. The first one is importing all elements of p at each row calculation. As a result, [N/nspe]-times vector p must be imported for each SPE. That is to say, N×nspe× N×4=4N2×nspe bytes will be imported from memory each time. The second method is that vector p is separated into nspe blocks, namely p0, …,
pi is kept in LS of SPEi via intra-chip network and can be transferred among SPEs. As a result, p must be imported from memory one time with 4N2 bytes, and totally (N-[N/nspe])-times p are transferred among SPEs with 4N2(nspe-1) bytes. Compared with the first method, this method can reduce 4N2(nspe-1) bytes data transfer and replace outer-chip communication with intra-chip communication which has 10-times bandwidth than the former. So, the second method is better.
Both of the methods described above need to transfer p many times between SPE and memory or among SPEs with DMA. Although the second method exploits faster intra-chip transfer instead of outer-chip transfer, it makes the control complex.
In order to further reduce data transfers of p, Ai is separated into m blocks by column: Ai0, …, Aim-1, as shown in Fig.2(b). p should be split into m blocks, …,
All SPEs can calculate qij=Aij×
simultaneously, since they only need the elements of
.
is imported from memory to LS of SPE0, and then transferred to other SPEs via intra-chip transfer. Thus, 4N2 bytes are needed to be transferred from memory, and (nspe-1)×N×4 bytes among SPEs, which means that data transfers can be reduced from O(N2) to O(N) by rearranging elements of Ai and at most 4×[N/nspe] bytes space can be used to store temporary results.
2) Column-oriented blocking method
In contrast, the column-oriented blocking method separates the sparse matrix A by column into A0, …, as shown in Fig.2(c). Vector p is split into nspe blocks too. SPEi is in charge of computing
Ai×pi,
where is the temporary result of qi, i.e.
pi is held in LS of SPEi during SPE execution, and then no more transfers for pi are needed because the calculation of SPEi only depends on pi and Ai. Consequently, 4N bytes of p are imported from the memory only once for a sparse matrix multiplication.
In the case of DMA transfer, non-zero elements of sparse matrix A must be rearranged, as shown in Fig.2(c). The rearranging procedure will increase the execution time of the program, but fortunately, the rearranging procedure is only done once because A is unchanged during the whole process. Compared with the best case in the row-oriented blocking method, this method reduces (nspe-2)×N×4 bytes of data transfer for p.
After finishing the calculation of temporary result
,
must be executed to get the final result.
The job of calculating the final result is assigned to PPE, which is in idle status while SPEs are working. On the basis of conclusion of DMA transfer analysis presented in Ref.[18], nq elements of temporary result are packed as a group and one group is exported each time for the purpose of efficiency of DMA transfer, where 32≤nq≤ 4 096. Once finishing nq elements calculation, a group of temporary results will be exported. SPEs note PPE to do the procedure of summation immediately after completing exporting a group of temporary results. As a result, PPE and SPEs can work in parallel style.
In summary, the column-oriented blocking method is better on reducing DMA transfer and has more parallelism.
2.1.2 Scheme 2: Changing store format of sparse matrix
In the original code, the sparse matrix is stored in compressed sparse row (CSR) format. The CSR storage scheme uses three arrays to describe a sparse matrix. The array Vval holds only the values of non-zero elements of the sparse matrix. The array Vcolidx holds the column indices corresponding to elements in Vval. The array Vrowsr holds the starting indices of the rows, i.e. the first index of the rows of non-zero element.
The CSR format in the original code consumes Nnnz×4+Nnnz×4+N×4 bytes of memory, where Nnnz is the number of non-zero elements of sparse matrix A. In the general CPU with transparent cache, this format requests lower memory capacity and less memory access. But it does not work efficiently on heterogeneous multi-core processor with explicit control memory, because many penalty latencies are caused by mispredicted control instructions.
According to the column-oriented blocking scheme, every SPEi has its own array Vrowstr, and the temporary result is
(1)
Let rnow=Vrowstr[i], rnext=Vrowstr[i+1]. Array Vrowstr is separated into [N/bm] blocks, i.e. brow[i], 0≤i≤[N/bm]-1, where bm is the number of elements transferred to LS via DMA each time. At the same time, Vval and Vvcolidx are split into [Nnnz/bvcn] blocks, such as bval[i] and bcol[i], where 0≤i≤[Nnnz/bvcn]-1, and bvcn is the number of elements of Vval and Vcolidx transferred each time.
The values of Vnow and rnext are different in the following conditions.
1) Firstly, If 0≤i≤bm, then rnow=Vrowstr[i] and rnext= Vrowstr[i+1], as shown in Fig.3(a).
2) Secondly, if i=bm-1, in another word, Vrowstr[i] is the last element of brow[j], as shown in Fig.3(b), then the following steps should be done:
(1) rnow=rowstr[bm-1]
(2) Import next brow, brow[j+1]
(3) rnext=vrowst[0].
The computation of Eq.(1) is referred to the values of rnow and rnext:
1) Firstly, if rnow=rnext, then q′[i]=0 and i=i+1.
2) Secondly, if rnow≠rnext, then
(1) when |rnow/bm|=|rnext/bm|, as shown in Fig.4(a), Vval[rnow] and Vval[rnext] belong to the same block bval[j], and Vcolidx[rnow] and Vcolidx[rnext] belong to the same bcol[j] at the same time, and there is
(2)
(2) When |rnow/bm|>|rnext/bm|, as shown in Fig.4(b), Vval[rnow] and Vcolidx[rnow] belong to bval[j] and bcol[j], respectively, and Vval[rnext] and Vcolidx[rnext] belong to bval[j+k] and bcol[j+k], respectively. Therefore bval[j+1], …, bval[j+k] and bcol[j+1], …, bcol[j+k] have to be transferred one after another. As a result, there are several steps for calculating q[i]:
Import bval[j+1] and bcol[j+1]
…
Import bval[j+k] and bcol[j+k]
Fig.2 Blocking operation chart: (a) Row-oriented; (b) Improved row-oriented; (c) Column-oriented
Fig.3 Diagrams of different values of rnow and rnext according to i: (a) Values of rnow and rnext while 0≤i≤bm; (b) Values of rnow and rnext while i=bm-1
Fig.4 Different conditions of computational procedures according to values of rnow and rnext: (a) |rnow/bm|=|rnext/bm|; (b) |rnow/bm|> |rnext/bm|
From the discussion above, the conclusion may be safely drawn that we have to appeal a lot of control instructions to distinguish different conditions. In order to lower the control complexity, MM format is used to store sparse matrix instead of CSR format.
In MM format, the size of array Vrowstr and Vrowidx are expanded to Nnnz, and then split into [Nnnz/bvcn] blocks, that is brow[i], where 0≤i≤[Nnnz/bvcn]-1. As a result, the computation for MM format is
q(brow[i])=q(brow[i])+(bvol[i]×p(bcol[i])) (3)
In short, it is more efficient that computing SPMV in SPE with MM format, due to lower control complexity. Comparing with CSR format with several different computational procedures, there is only one computational procedure with MM format. The scheme of substituting MM format for CSR format is a space-time tradeoff method, which costs more (4Nnnz-4N) bytes of memory and DMA transfer.
2.1.3 Scheme 3: Overlapping computation with data transfer by 2-buffer
Computation can be overlapped with data transfer by 2-buffer scheme. The computation and data transfer in deferent buffers can be performed simultaneously to hide the computation of data transfer latency. 2-buffer method is employed to both import Ai and export q′i during computation. Since the computation of indirect vector pi occupies a majority of execution time in sparse matrix, DMA transfer latency can be hidden by computation.
2.1.4 Scheme 4: Overlapping LS access latency with loop-unrolling
SPE must load data from LS to register before calculating, and then write the results back to LS. The time of computation is
tcalc=tLS+tregister
where tLS presents the time of LS access by LS instruction, and tregister indicates the execution time of calculation by register instruction.
In SPE, registers all have 128 bits, and operations are all in SIMD. If we operate scalar data in SPE, we have to rotate the scalar data into the preferred scalar slot, the left-most word of a register, via rotating instruction at first. Similarly, the scalar data must be shuffled to the storage location by shuffle instruction before writing- back operation.
The latency of each instruction is shown in Table 1.
Table 1 Characteristics of scalar operation instructions
Due to the indirect accesses of q and p, SIMD technology cannot be employed for SPMV computation on SPE. SPE can issue a LS access instruction and a register instruction simultaneously while those instructions are data independent. According to the discussion above, the SPMV computational procedure in SPE is shown in Fig.5 via assembly-like language.
The ‘Start time’ is the earliest time of an instruction that can be issued. Obviously, instructions cannot be issued continually because of instruction execution latency and data dependence. For example, the instruction ‘Rotate brow’ cannot be issued until instruction ‘Load brow’ has finished due to the data dependence of brow.
Fig.5 Computational procedure of SPMV on SPE
In Fig.5, it consumes about 40 cycles to compute Eq.(3), and there are about 18 cycles among LS instructions caused by the data dependence, i.e. 9 cycles between ‘Load bval’ and ‘Load q’ and 9 cycles between ‘Load p’ and ‘Store q’. In another word, the stalling time is about 45% of the total time. Loop unrolling technology is utilized to overlap the stalling time with instructions that belong to other iterations. As a result, the computation time can be reduced to 25 cycles averagely by 4 times using loop-unrolling.
2.2 Prioritized storage model for reusable data
The definition of symbols used in the model is shown in Table 2.
Table 2 Definition of symbols used in model
In heterogeneous processor, the time of a program execution is composed of calculating time, data transfer time and control instruction execution time:
t=tcalc+tdata+tcontrol (4)
Data transfer can be reduced via storing reusable data in LS. As shown in the flow chat, there are several reusable data among iterations, such as vectors p, q, r, z and sparse matrix A. These data have to be imported from memory before calculating and exported to memory after calculating via DMA transfer because of lack of capacity of LS. So, data transfer time becomes
tdata=tml+tlm (5)
where tml is the time of importing from memory to LS and tlm vice versa.
(6)
where tdma is the time of a DMA transfer of size bml, and it is a function of bml and nspe [18].
Tml is the size of data needed to be imported. Similarly, Tlm is the size of data needed to be exported. As a result, the total size of data transfer is
(7)
where Tml,i is the import data size in the i-th iteration and Tlm,i vice versa. It is found that Tml,i is different between the first iteration and the others.
1) In the first iteration, there is
(8)
where import whole sparse matrix A at step 1;
import whole vector p at step 1 and import the transfer part of p at step 2, 4 and 8;
import the transfer part of q at step 2 and 5;
import whole vector r at step 5 and import the transfer part of r at step 6 and 8;
import whole vector z at step 4.
So,
(9)
2) In other iterations, there is
2≤i≤ti (10)
According to the blocking scheme described above, q′i, the temporary results of qi, must be exported in every iteration as So,
(11)
(12)
Suppose ti>>1, and let D′=DA+Dp+Dr+Dz+nspe×Dq, then
(13)
According to Eq.(13), the storage priority for those reusable data is defined as p:q:r:z:A=5:2:4:2:1. Then, the capacity of LS is allocated to satisfy the higher storage priority data. There are seven allocating conditions:
(1) If ≤
LS is large enough to store all the reusable data. So T=D′ and
(2) If ≤
and
>
, sparse
matrix A has to be transferred in every iteration, and then
(3) If and
z can be stored in
because A and z are not simultaneous required, so
(4) If and
then
(5) If two conditions should be considered:
and
then
(6) If and
then
7) If , then
According to the result of DMA transfer analysis [18], the size of DMA transfer should be set to be multiple times of 128 bytes and lager than 2 kB to obtain the peak performance of DMA transfer. Let 16 kB be the upper limit of DMA transfer, and 2 kB≤s≤16 kB and s= 128x are defined, where To employ 2-buffer scheme, 4 kB≤s≤32 kB and s=128x are defined.
Three arrays are used to store the sparse matrix A, with 12 bytes for one element. As a result, should be multiple times of 12, that is,
By considering DMA transfer characteristics and 2-buffer scheme, 4.5 kB≤
≤96 kB and
are defined.
2.3 Summation by PPE
Step 2 and step 6 of subroutine conj_grad are assigned to PPE, which is in idle status while SPEs are working.
In step 2, d=d+p×q is computed, and the scalar d depends on SPMV result q which is summed by with PPE part by part, as described in blocking scheme. As a result, d can be computed when the computation of partial result of qi is completed. Likewise, step 6, the computation of rho=rho+r×r can also be processed in the same way.
After the optimization above, the optimized parallelization strategy of conj_grad can be got, as shown in Fig.1(b).
3 Performance evaluation
Experiments are performed on an IBM QS20 server that contains two chips of Cell BE at 3.2 GHz and 512 MB Rambus Dual XDR memory. Fedora Core 7 Linux, with 2.6.22 kernel and program lib cellsdk 3.0, is stalled. Because of the limitation in memory, only configurations A and B are considered here. In configuration of A, the size of sparse matrix is 1.4×104×1.4×104, the times of iterations is 15 and the number of non-zero elements in each row is 11. For configuration of B, the parameters are 7.5×104×7.5×104, 75 and 13, respectively. When running on single PPE, the execution time of CG for size A is 25.23 s, and fore size B it is 3.160 22×103 s.
3.1 Comparison of CSR and MM storage format
Table 3 shows the performance comparison between CSR and MM format. The ‘Memory access time’ includes the time of importing vector p and sparse matrix A and exporting result. The ‘Control time’ represents the time consumed by the control instructions during execution. The ‘Computation time’ stands for the time of SPE computation. The ‘SPMV time’ is the time of a SPMV procedure.
As shown in Table 3, the time of data import with CSR format is about 2/3 that of MM format since space consuming is about 2/3 that with CSR formant correspondingly. The computational complexities of both CSR and MM formats are basically the same, so they have the same computation time. But the control flow of the CSR is more complex. The more complex the control flow is, the more the condition branch instructions are needed. Because of 18-cycle penalty for branch misprediction, the time of control instructions with CSR format is about 5.35 times that of MM format. Therefore, MM format is better than CSR format.
3.2 Performance evaluation with various
According to the prioritized storage model and considering size B, 146.5 kB is assigned to store all elements of vector p, q, r and z in each LS. The code size is about 20 kB based on experience. At last, we can assign 96 kB to
should be multiple times of 384 bytes and
≥45 kB to achieve peak bandwidth of DMA transfer.
As shown in Fig.6, the execution time of CG is decreased along with size of increasing. When
≥45 kB, the same performance is obtained as
=96 kB. In another word,
=4.5 kB is enough for the storage model.
Table 3 Performance comparison of CSR and MM storage format
Fig.6 Performance evaluation for size of
3.3 Performance evaluation for each scheme
Fig.7 and Fig.8 show the performance evaluation of each optimization scheme.
In Figs.7(a) and (b) , the total time of CG is reduced, along with the SPMV time reducing with blocking sparse matrix, changed storage format, 2-buffer and loop- unrolling scheme. Through the former four schemes, the SPMV time is reduced from 6.5×10-2 to 2.81×10-3 s for size A and from 1.64 to 1.9×10-2 s for size B. At the same time, the total time is reduced from 25.23 to 1.160 s and from 3.0891 9×103 to 23.09 s, respectively. The SPMV time cannot be reduced with the last two schemes, but the performance of CG can be improved with them because they reduce the data transfers and exploit the parallelism of PPE and SPEs. Finally, with the later two schemes, the total time of CG is reduced from 1.745 to 1.160 s and from 45.18 to 23.09 s for sizes A and B, respectively.
In 2-buffer scheme, the time can be reduce by about 9.1×10-4 and 1.02×10-2 s, respectively, which is basically equal to the sum of import time and export time in Fig.7. In the loop-unrolling scheme, the computation time can be reduced to 2.82×10-2 and 1.89×10-2 s for sizes A and B, respectively.
As shown in Fig.8, although executing on PPE only, there are about 1.31- and 2.46-times speedups, respectively, for sizes A and B with blocking sparse matrix scheme compared with the non-optimized code, which indicates that the blocking scheme can increase the cache hit ratio and reduce the memory access. Changing storage format scheme can bring about 10.07- and 38.48-times speedups, respectively. As a result, 21.75- and 133.79-times speedups can be obtained, respectively, compared with those of the non-optimized code.
Fig.7 SPMV time and total time for each scheme: (a) Size A; (b) Size B
Fig.8 Speedup for each scheme: (a) Size A; (b) Size B
4 Conclusions
1) A multi-core version of CG program for heterogeneous multi-core processor is developed. Six types of memory optimization schemes are performed in the multi-core version. Four schemes of them aim at all kinds of SPMV operation on heterogeneous processors, and peak bandwidth of memory access on Cell BE can be obtained.
2) The influence of control instructions on performance is also analyzed. It is found that a simple computation is more useful than the less memory access with complicated control procedure.
3) Loop-unrolling is used to hide LS access latency. Loop-unrolling works efficiently when running a scalar program on SIMD processor.
4) In order to reduce reusable data transfer and take full advantage of memory access bandwidth, a prioritized storage model for reusable data among iterations is built.
5) Compared with single PPE, the speedups of 21.75 and 133.79 times can be achieved for sizes A and B, respectively.
References
[1] PETRINI F, FOSSUM G, FERNANDEZ J, KISTLER M, PERRONE M. Multicore surprise: Lessons learned from optimizing sweep3D on the cell broadband engine [C]// IPDPS. California, 2007: 62.
[2] BADER D, AGARWAL V, MADDURI K. On the design and analysis of irregular algorithms on the cell processor: A case study of list ranking [C]// IPDPS. California, 2007: 76.
[5] CARTER J, HSIEH W, SWANSON M, ZHANG Li-xin. Impulse: Building a smarter memory controller [C]// HPCA. Orlando, 1999: 70-79.
[8] HEATH, PINAR A, MICHAEL T. Improving performance of sparse matrix-vector Multiplication [C]// ACM/IEEE SC Conference. Portland, 1999: 30.
[9] STATHIS P, VASSILIADIS S, COTIFANA S. A hierarchical sparse matrix storage format for vector Processors [C]// IPDPS. Santa Fe, 2004: 61a.
[10] AZEVEDO E, FAHEY M, MILLS R. Vectorized sparse matrix multiply for compressed row storage format [C]// ICCS. Atlanta, 2005: 99-106.
[16] CHRISTEN M, SCHENK O, BURKHART H. General-purpose sparse matrix building blocks general-purpose sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform [C]// GPGPU. Boston, 2007: 38-46.
[17] SMIT, WIGGERS W, BAKKER V, KOKKELER A. Implementing the conjugate gradient algorithm on multi-core systems [C]// Tampere: SOC, 2007: 1-4.
[18] DOU Yong, DENG Lin. DMA performance analysis and multicore memory optimization for SWIM benchmark on the Cell Processor [C]// ISPA. Sydney, 2008: 170-179.
(Edited by YANG Bing)
Foundation item: Project(2008AA01A201) supported the National High-tech Research and Development Program of China; Projects(60833004, 60633050) supported by the National Natural Science Foundation of China
Received date: 2010-07-07; Accepted date: 2010-12-27
Corresponding author: DENG Lin, PhD Candidate; Tel: +86-13975113732; E-mail: denglin@nudt.edu.cn