+Advanced Search
  • Volume 40,Issue Z1,2013 Table of Contents
    Select All
    Display Type: |
    • An Ultra-low Temperature Coefficient Bandgap Voltage Reference

      2013, 40(Z1):1-5.

      Abstract (1636) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:In order to improve the temperature characteristics of bandgap voltage reference, this paper took advantage of Buck's voltage transfer cell generating a positive temperature coefficient to provide a high-order curvature compensation of VBE. And Cascode structure was used to improve the power supply rejection ratio (PSRR). The circuit was simulated in 0.5 μm CMOS process. The output voltage of bandgap reference is 996.72 mV under 5 V supply available, and a temperature coefficient of 1.514 ppm/℃ can be achieved over the temperature varying from -25 to 125 ℃. The PSRR reaches 59.35 dB and an average line regulation reaches 0.4 mV/V when power supply changes from 2.5 to 5.5 V.

    • A Method of Controlling Chip Layout Density Based on Grid Division

      2013, 40(Z1):6-11.

      Abstract (1029) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:In physical design, local high density issue always results in place and routing congestion problems. This paper presented a method of controlling chip density based on grid division in order to alleviate the drawbacks of congestion-driven optimizations by EDA tools. This paper used Synopsys IC Compiler as major experiment tool, divided the design block into several grids, and analyzed the layout-density information within each grid to control and optimize the possible congestion areas. Meanwhile, the design timing has also been improved. The effectiveness and feasibility of this strategy has been verified with actual project examples.

    • Customizing and Application of RC Corner

      2013, 40(Z1):12-17.

      Abstract (1004) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:This paper designed the flow and method of custom RC corners, customized a new RC corner, and estimated the coverage of the custom corner. The result shows that the coverage of custom RC corner to the other corners can reach up to 99%. Finally, we improved the traditional MMMC analysis flow with the custom RC corner. The result of application case in engineering shows that the runtime of tools reduced greatly at the cost of buffer count and buffer area. The buffer count increased by 22.07% and the buffer area increased by 21.65%, whereas the runtime of tools decreased by 84% after the timing was closuring.

    • A FPGA Design and Implementation of Low-complexity Decoder for LDPC Code

      2013, 40(Z1):18-22.

      Abstract (1036) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:Taking advantage of the good approximation performance of Chebyshev polynomial, this paper proposed a BP algorithm based on Chebyshev polynomial fitting. And this method can transform the complicated index formula into polynomial, which can reduce the consumption of memory resources. At the same time, a Chebyshev structure with shift operation was proposed to reduce the complexity brought by multiplier; also a semi-parallel architecture with pipeline design was proposed to reduce the complexity of BP decoder. The experimental results show that such a structure can effectively reduce the hardware resources.

    • Research on Reconfigurable Clustered Architecture Model and Task Mapping Targeted at Block Cipher Processing

      2013, 40(Z1):23-29.

      Abstract (962) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:A Reconfigurable Clustered Architecture Model named RCCPA and Task mapping method were proposed. Based on task ready list, this technique avoids the deadlock problem of the division of tasks. Through the development of block cipher's parallelization between packets and pipeline characteristic of the group, the unit utilization and cryptographic processing performance of the RCCPA architecture were improved. AES / DES / IDEA .etc cryptographic algorithm used automated method of mapping adaptation in RCCPA architecture, and the results show that the proposed model and method effectively improve the processing performance of the block cipher algorithm.

    • A Capacity Sharing Mechanism Based on Fine-grained Pseudo-partitioning between Private Caches for Chip Multiprocessors

      2013, 40(Z1):30-36.

      Abstract (1022) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:A cache capacity sharing mechanism based on fine-grained pseudo-partitioning (CSFP) was proposed, which was aimed at the capacity miss problem confronted with the private caches in Chip Multiprocessors (CMP). Each cache bank was equipped with a weighted saturation counter array, designed to collect and predict the memory demand diversity experienced by different threads at a fine granularity. The private region and shared region on each cache set were adjusted adaptively, and the partition decision was used to not only guide the replacement of the victim block, but also control the co-operation of spilling and receiving dynamically. An intelligent capacity sharing mechanism was adopted to correct the memory imbalance between different cores, which mitigated the capacity misses in CMP private cache effectively. Experimental results based on a cycle-accurate architecture simulator show that the CSFP mechanism can reduce the capacity misses of private caches in CMP significantly, so the average memory access latency of different programs can be reduced to some extent.

    • The Design and Optimization of BSU in YHFT-DX

      2013, 40(Z1):37-43.

      Abstract (791) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:Based on the features and performance requirements of BSU (Branch & Shifting arithmetic logic Unit) in YHFT-DX, a new structure partition and a strategy of implementation were proposed, and the critical path and the corresponding design method were determined. The arithmetic operation module and shift operation module with tension timing were designed and optimized by hand semi-custom design method. Timing verification and analysis show that the timing is optimized for 6.86%, the area is decreased by 10.64%, and the frequency (1.0 GHz) is achieved.

    • Design and Implementation of Multi-thread Processor's Instruction Dual-issue Structure

      2013, 40(Z1):44-50.

      Abstract (1110) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:The performance of the single thread is an important element in the processor design. In this paper, multi-thread dual-issue structure in T2 was modified to support single-thread dual-issue, which would improve the performance of a single thread. The results show that the designed structure achieves the expected functions, and is able to improve the performance of a single thread.

    • Key Techniques of Design and Simulation of Cache Access Time in Deep Sub-micron and 3-Dimension Era

      2013, 40(Z1):51-60.

      Abstract (1154) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:This paper studied the key techniques of designing and simulating cache access time in deep sub-micron and 3-dimension era, and simulated the cache with different capacity, associativity and storage technology. The results show that, in 40nm technology, the interconnect network is a main source of the access time (up to 61.1%); the tag comparator can affect the cache access time for about 9.5%. This paper improved the existing cache access time model in which tag comparator gets insufficient attention. Based on the growing trend of the large last level cache (L3C) capacity in multi-core processors, the advantages of eDRAM on power and area make it more attractive. The simulation shows that, for L3C with large capacities (1MB, 4MB and larger than 16MB), the access time of the eDRAM cache is less than the SRAM cache for 8.1% (1MB) to 53.5% (512MB), supporting that eDRAM is a better choice for LLC in future 3D multi-core processors.

    • A Fast and Hierarchical Early Z-Test for 3D Graphics Processors

      2013, 40(Z1):61-67.

      Abstract (943) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:A Fast and Hierarchical Early Z-Test (FH-EZT) was proposed to reject the pixels unnecessary to draw as soon as possible from the tile level and pixel level by combining Z_max and Z_min algorithm. Redundant pre-pixel operations including Z reads/writes, color reads/writes and texture reads were avoided efficiently to decrease the rendering times. Shared tile cache (TileZcache) with high hit rate cuts down the testing cycles and the values of tiles can update dynamically utilizing less cost. Experiments show that the proposed algorithm can reduce 12.5% up to 25.6% rendering cycles for each random tested frame and enhance 4% up to 43.8% for the ratio of bandwidth reduction and storage area per pixel, which is suitable for embedded 3D engine.

    • Structure and Method for Hardware Acceleration of Variable Data Set Management

      2013, 40(Z1):68-73.

      Abstract (930) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:A general hardware structure was proposed to accelerate variable data set management, which was designed to accept instructions flexibly and accomplish the commonly used functions and some more complicated functions of the linked-list data structure .The structure can access the data based on both pointer and address mechanism. In order to fully utilize the limited memory resources, we proposed a memory recycle scheme to reuse the memory space where the data have been deleted. Experimental results on FPGA show that our proposal can accelerate the variable data set management. Only few hardware resources were used and it consumed pretty low power. Compared with the software linked-list structure in PC, our proposal in FPGA achieved high speedups.

    • Analysis on Coupling Characteristic in On-chip Resonant Clock Array

      2013, 40(Z1):74-78.

      Abstract (871) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:This paper proposed a coupled clocking array architecture based on hierarchical bufferless resonant clock, which can distribute global clock signal effectively, and implement locking of frequency and phase among local clock networks. Based on the theory of coupled oscillators, the voltage amplitude, frequency locking and network bandwidth characteristics of the coupling network were analyzed. By SPICE simulation, the key factors influencing the coupling characteristics of the resonant clocking array were studied, including clock load difference, energy compensating cell, and coupling network. Simulation results show that the resonant clocking array has a wide frequency locking range, and in the case of coupling characteristics change, the maximum clock skew in global clocking network is 21 ps, less than 2% of the whole clock cycle.

    • Straight-forwarding Route Preconfiguration Mechanism for Latency Optimization in NoCs

      2013, 40(Z1):79-87.

      Abstract (1007) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:We proposed a straight-forwarding route pre-configuration (SFRP) router architecture for the communication spatial locality when packets traverse under dimension-ordered routing mode, which was adapted to the latency optimization for the packets straight forwarding traversal. In our SFRP router, a corresponding straight-forwarding route was preconfigured at each input port, which connected the input port with its corresponding straight forwarding output port. Combining appropriate route reuse with termination mechanism, the subsequent packets satisfying the comparative conditions were expected to directly forward to crossbar without SA stage, hence reducing the average latency for packets traversal. Our evaluation with synthetic workload traffic shows that SFRP router can achieve obvious performance improvement by up to 59%, 46%, 25.6% and 9.5% respectively before the packets injection rate is saturated, compared with the BASE, BASE_LR, BASE_LR_SPC and PSEUDO_CIRCUIT routers. In the real application traffic workloads, the performance improvement of the SFRP router is analogous to that of PSEUDO_CIRCUIT router. Compared with other three kinds of routers, SFRP router can achieve obvious performance improvement by up to 57%, 45% and 21% respectively.

    • A 2 GHz Network-on-chip Communication Unit for Multi-core Microprocessors

      2013, 40(Z1):88-95.

      Abstract (965) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:A 2 GHz network-on-chip communication unit for multi-core microprocessors was proposed. A 2 GHz frequency in 45nm process technology can be reached and the pipeline stage is 2. There are eight bi-direction communication ports totally and the peak bandwidth is 32 GBps in each port. A test environment for network-on-chip which supports 16 high-performance processor cores was built. The test results show that network-on-chip constructed by the proposed communication unit can meet the requirements of network bandwidth by 16-core processor storage system. In the case that memory access is optimized, the aggregate bandwidth can be increased linearly with the number of the processor and thread increase. In addition, the communication unit has reusable features and can continue to be used for network-on-chip in many-core processor when it will be optimized and expanded in future. The idea of this paper has been used successfully in one of self-designed 16-core high performance microprocessors. The frequency of network-on-chip logic has been reached to 2 GHz.

    • An Unblocking Descriptor Injection Method Based on Multi-VP Shared Buffer

      2013, 40(Z1):96-104.

      Abstract (891) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:This paper proposed a novel multi-VP shared buffer named DAMQ-PD with mixed PIO and DMA for descriptor injection on NIC chip to decrease the memory and area requirement of statically allocating buffer among multiple virtual ports (VP). An address queue was used to record the address of every data in the shared buffer. Combining each VP's head pointer and tail pointer, each VP's data can be linked in the shared buffer according to input sequence. By doing so, pipelining reading and writing of the shared buffer can be implemented. A heuristic credit management method was also proposed to distribute credits according to the need of each VP, which can automatically switch descriptor injection method from PIO to DMA or vice versa, thus avoiding block execution of user process when no credit is available. Analyses and simulations show that DAMQ-PD achieves high buffer utilization, pipelining write and read, high message issue rate, thus satisfying the low latency and large capacity performance requirement of descriptor injection on NIC.

    • A Multi-VC Shared Prefetch Structure for Input-buffered Switch

      2013, 40(Z1):105-111.

      Abstract (886) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:At present, the read latency of input buffer in switch is increasing, which greatly decreases the throughput of crossbar. To address this issue, a multi-VC shared prefetch structure was proposed in order to hide the read latency of data buffer implemented by SRAM with registered output. Some critical functions of SPB were designed, such as bypass write control, the management of write and read address, prefetch control, etc. Moreover, the SPB structure was implemented in Verilog and its performance was tested by cycle accurate simulator. The simulation results and analysis show that input buffer with SPB structure will not only decrease the read and write latency but also increase the throughput of the input buffer. The SPB structure proposed can be used in combination with either SAMQ or DAMQ buffer to speed up the read and write operation of the buffer, and further improve the throughput of the whole switch.

    • Dynamic Injecting Strategy for Congestion Avoidance for NoC

      2013, 40(Z1):112-116.

      Abstract (1344) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:Network-on-chip congestion greatly limits the effective performance of the router, and directly affects the performance of the processor chip. This paper divided the network's congestion state by using different threshold, and set the congestion avoidance apart into two distinct phases: congestion prevention and congestion relieving. Then a dynamic injection strategy was proposed. According to the congestion state in real-time, packet injection rate was adjusted dynamically, and communication traffic was controlled in reasonable level to decrease load press, avoiding congestion effectively. The simulation result shows that the performance of congestion prevention is almost at "Cliff" point, while congestion relieving is almost at "Knee" point, the injection rate can reach 0.05, and network performance is tradeoff with congestion avoidance.

    • Research on FC-Switch and Its Routing Algorithmin TH-1A Interconnect Network

      2013, 40(Z1):117-124.

      Abstract (816) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:This paper proposed FC-Switch, which is a nova combined switch, defined its switch-level connection pattern and primarily analyzed its performance. Moreover, four routing algorithms for the FC-Switch were discussed, and the experiments on TH-1A network testing platform was carried out. Experiments result shows that the FC-Switch can achieve a good performance by correctly choosing switch-level connection pattern and the routing algorithm.

    • IP Lookup Architecture and Algorithm Based on Distributed Storage and Forwarding of Routing Table

      2013, 40(Z1):125-129.

      Abstract (1159) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:This paper proposed a parallel multi-level pipeline IP lookup architecture based on distributed storage of routing table that consists of multi-stage lower speed nodes called FSN performing IP-lookups and switching independently. An IPv6 binary lookup algorithm was proposed based on prefix scope called PSB-BS (prefix scope based binary search) for putting IPv6 longest prefix match in practice. The IPv6 route table was partitioned into multiple levels, each representing a specified range of prefixes. By doing binary search over these subtrie levels and especially by constructing asymmetric binary tree, our solution implemented distributed storage of forwarding table, thus reducing the storage overhead as well as the complexity of IP lookup. The experiment results demonstrate the PSB-BS algorithm reduces the storage and memory access overhead considerably, compared with the tree bitmap algorithm widely used in Cisco commercial routers.

    • Research on Partial-connected Crossbar for Full-distributed VLIW Architecture

      2013, 40(Z1):130-135.

      Abstract (1093) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:With the development of VLSI technology, the communication overhead of functional units of the full-connected network for full-distributed VLIW has become a bottleneck restricting the increase in processor frequency and scale. Based on the analysis on the characteristics of the application and 5 defined communication models, a variety of partial-connected architectures for fully-distributed VLIW were presented. The difference between partial-connected and full-connected architectures was analyzed and the related compilation modulation was accomplished, especially function unit designation and communication scheduling. Model analysis and experimental data show that, compared with full-connected architecture, partial-connected architecture can substantially reduce area and power consumption and resource overhead, and gain considerable scalability, while program performance is slightly lower.

    • NIC Based Hardware Offload of MPI Barrier for Exascale Super Computer

      2013, 40(Z1):136-141.

      Abstract (1088) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:Barrier synchronization is an important communication pattern for high performance super computers. This paper proposed a new NIC-based barrier communication offload method. The new method improved the traditional dissemination barrier algorithm to support parallel barrier message sending and receiving, which greatly reduced the communication delay. Based on the new barrier algorithm, this paper proposed new descriptor based hardware-software interface and the hardware implementation. The performance was greatly improved, compared with the traditional barrier implementation.

    • An Effective Memory System Verification Method Based on ASIC Emulation Acceleration System

      2013, 40(Z1):142-147.

      Abstract (1197) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:With the development of microprocessor, emulation accelerator based verification has become the most effective system verification method. And the system frequency is one of the most important indexes of the emulation acceleration system. Based on the engineering application of the system verification to a homemade high performance microprocessor FT-X on a ASIC emulator, research was done by tuning the compile parameters assisted with compile results analysis. The acceleration mechanism of ASIC accelerator was analyzed. And then, the effect of domain number, normal register design mapping method, special register design mapping method on the system emulation frequency was studied. The results show that it's not a good idea to increase the domain number as much as possible, because there exists a sound range of memory size when the design under test is fixed. And the system emulation frequency was increased sharply by the application of forcible mapping method to some special blocks on the other hand.

    • A Peak Performance Model for Matrix Multiplication on General-Purpose DSP

      2013, 40(Z1):148-152.

      Abstract (1769) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:DSP processor can be used to solve the high performance computation problems, which has the characteristics of high computing performance and low power. Matrix multiplication algorithm is the kernel of many scientific and technology computation, so it is of importance for theorem and practice. Based on general purpose DSP (GPDSP), a new parallel algorithm for matrix multiplication was proposed. And a peak performance model for matrix multiplication was built. From the peak performance model, an architecture of GPDSP was set up, and the parameter of GPDSP with Tflops was given, which includes the number of pipe-line, the number of SIMD registers, the breadth and latency for the hierarchical memories.

    • Empirical Analysis on Human Behavior Dynamics in Online Forum

      2013, 40(Z1):153-160.

      Abstract (908) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:This paper reported an empirical analysis on user behavior dynamics in online forum. The analysis results on the posts show that both the distribution of the reply number of posts and the distribution of different user number of posts follow power-law distributions with heavy tails, while the distribution of the browse number of posts has no laws. We observed positive correlation between browse number of posts and reply number of posts and the ratio of them is bigger than 10. The statistic results of the users' actions show that the post number, the reply number and the number of root posts which a user has posted replies all follow power-law distributions, which means that the user behaviors in online forums are heterogeneous, and most users post or reply rarely while few users post or reply frequently, and that most users have a small range of concerns while few users have a large range of concerns. We also observed that both the distribution of one-user one-day reply number and the distribution of one-user one-post reply number follow power-law distributions, which means that some people submit a large number of replies on a few of days or submit a large number of replies on a few of posts. The findings of this paper may not only provide guides to online user behavior modeling but may also be applied to online public opinion monitoring and online water-army finding.

    • The Multiplicative Model in Time Series and GARCH Error Amending Model and Its Application

      2013, 40(Z1):161-164.

      Abstract (1184) HTML (0) PDF 0.00 Byte (0) Comment (0) Favorites

      Abstract:ANN and SVM forecasting models need large sample data, and the traditional time series forecasting model cannot fit sufficiently the biggest load due to random factors. And in order to overcome the shortcomings as mentioned, this paper applied the season-multiplicative model in time series to forecast the monthly peak load of region, and adopted the GARCH model to modify the forecasting error. The application results of the proposed model in a regional power grid show that the forecasting is precise, because the error rate is only 2%. And compared with the unmodified model, the new model’s error rate decreased by 0.5%.

Journal Browsing
The current ranking