Stream processing

出典: フリー百科事典『ウィキペディア（Wikipedia）』

Stream processing（ストリーム・プロセッシング）は、新しい並列コンピュ―デング（並列処理）方式で、従来のアーキテクチャに比較して、同じ電力消費とダイサイズで約20倍の性能を提供する。

この方式では、与えられた入力データセット(input stream)に対して、ストリームの各要素への演算操作を定義する(カーネル)。理論上は複数のカーネルを持つことが可能であるようだが、uniform streaming方式では、ストリームの各要素それぞれに適用されるカーネルは1つしか使用しない。uniform sreamingの中でも特にSIMDでは、ストリームの連結は簡素化され、パフォーマンスの大きな向上が達成される。また、シンプルなプログラミングモデルはC言語による開発を可能にしつつ(アセンブラは必要ない)、ハードウェア上の最適なパフォーマンスを達成することができる。

Stream processing は本質的には妥協の結果である。モデルの柔軟性を犠牲にすることにより、簡単により高速な実行が可能になる。プロセッサの設計は、状況によって、最大の効率をとるか、または柔軟性のトレードオフをとるかに振られる。

[編集] 従来の並列コンピュ―デング方式との比較　

初期のコンピュ―タは、逐次実行方式からスタートした。初期のCPUは、1回に演算操作を1つ行うSISDをベースとしている。コンピューティングの必要性が増すにつれて、非常に短期間のうちに、取り扱われるデータの総量は増加した。シーケンシャルプログラミングモデルでは、プロセッシング能力の増大へのニーズに応えられないことが明らかだった。大量の計算を実行するための別の方法を探すのに、多大な労力が払われたが、唯一の解決策はあるレベルの並列実行を利用することだった。これはSIMDと呼ばれ、異なるデータのそれぞれに対して単一の命令を実行するプログラミング方式である。ほとんどの期間において、SIMDはSWAR環境(SIMD Within A Register)で利用された。また、より複雑な構造を採用することにより、MIMD並列性をも備えるようになった。

これらの2つの方式はまずまず有効ではあるものの、実際に適用するとメモリアライメント問題から同期問題、並列性の制約に渡る様々な制約に悩まされた。スタンドアロンコンポーネントとして生き残ったSIMDプロセッサはほとんどなく、標準的なCPUに埋め込まれる形で残ることになった。

ここで、4要素からなるベクタを100個で構成される配列(i.e.１配列に計400個のデータが含まれる)同士の加算というシンプルなプログラムを考える。

[編集] 従来の逐次実行方式

for(int i = 0; i < 100 * 4; i++)
    result[i] = source0[i] + source1[i];

この方法は、計算機科学専攻の学生の多くが考える初歩的なものである。内部ループや構造体の導入などのバリエーションは考えられるが、結局は上記に要約される。

[編集] SIMDによる並列実行方式, レジスタパッキング(SWAR)

for(int el = 0; el < 100; el++) // for each vector
    vector_sum(result[el], source0[el], source1[el]);

この例は簡略化しすぎではあり、またvector_sum命令が動作することを前提にしている。具体的な操作は命令に内包されているものの、ベクタの要素数やそのデータフォーマットなどは記述しない。このような記述方法により、より明快に記述されている。

ここで、numElements * componentsPerElementからnumElementsまでの無数の命令デコードが削減されることがわかる。数多くのジャンプ命令も削減される。さらに、4つの算術演算の同時並列実行によって大きな速度向上がもたらされる。

しかし、この方式において、データはパックされた状態でSIMDレジスタに格納されるため、より大きな並列性を得ることはできない。したがって、速度向上は、4並列演算という仮定に制限される(これはAltiVecおよびSSEにも共通の現象である)。

[編集] Parallel Stream paradigm (SIMD/MIMD)

// This is a fictional language for demonstration purposes.
streamElements 100
streamElementFormat 4 numbers
elementKernel "@arg0+@arg1"
result = kernel(source0, source1)

ソースにカーネルを適用して結果を得る。言語上は単純になるが、カーネルは複雑となる。ループを展開すると 100個の4ALUを持つカーネルが必要となる。

このSIMD/MIMDモデルは、多様な柔軟性があるが、Stream processor は、カーネルまたはstream sizeにいくつかの制限がある。例えば、コンスーマのハードウェアは、高精度算術演算、複雑な間接命令などが制限される。

[編集] ストリーム処理の考察

この記述の時点で(2005年 9月12日)、ストリーム処理に関する入手可能な文書は非常にとぼしく、ごく少数の機関しかこのモデルの潜在的なパワーを理解していない。スタンフォード大学は歴史的に、これに関する様々なプロジェクトに関係している。Stanford Shading languageに始まり、さらに進んでは、柔軟なスタンドアロンのストリームプロセッサImagineが行われた。それらのプロジェクトの結果、このパラダイムの潜在的な可能性の大きさが明らかになり、より大規模なプロジェクトMerrimacが開始された。Merrimacは、ストリームベースのスーパーコンピュータであり、現在研究が続行されている。 AT&Tで認識されているのは、ストリーム処理を拡張したプロセッサの広範な採用により、GPUが速度と機能の両面において急速に進化した点である^[1]。

[編集] Data dependencies and parallelism

A great advantage of the stream programming model lies in the kernel defining independent and local data usage.

Kernel operations define the basic data unit, both as input and output. This allows the hardware to better allocate resources and schedule global I/O. Although usually not exposed in the programming model, the I/O operations seems to be much more advanced on stream processors (at least, on GPUs). I/O operations are also usually pipelined by themselves while chip structure can help hide latencies. Definition of the data unit is usually explicit in the kernel, which is expected to have well-defined inputs (possibly using structures, which is encouraged) and outputs. In some environments, output values are fixed (in GPUs for example, there is a fixed set of output attributes, unless this is relaxed). Having each computing block clearly independent and defined allows to schedule bulk read or write operations, greatly increasing cache and memory bus efficiency.

Data locality is also explicit in the kernel. This concept is usually referred as kernel locality, identifying all the values which are short-lived to a single kernel invocation. All the temporaries are simply assumed to be local to each kernel invocation so, hardware or software can easily allocate them on fast registers. This is strictly related to degree of parallelism that can be exploited.

Inside each kernel, producer-consumer relationships can be individuated by usual means while, when kernels are chained one after the another, this relationship is given by the model. This allows easier scheduling decisions because it's clear that if kernel B requires output from kernel A, it's obvious that A must be completed before B can be run (at least on the data unit being used). The Imagine chip's on-board stream controller module manages kernel loads and execution in hardware at runtime keeping a scoreboard of kernel dependencies (as told by the compiler) and can allow out-of-order execution to minimize stalls. This is another major new paradigm for high performance processing. There are also hints the Cell processor allows this by routing data between various SPEs for example. In comparison, since the Imagine is a pure SIMD machine, inter-cluster communication and kernel execution is always explicit with much lower silicon overhead than a MIMD machine, such as Cell.

Recently, CPU vendors have been pushing for multi-core and multi-threading. While this trend is going to be useful for the average user, there's no chance standard CPUs can reach a stream processor's performance. The parallelism between two kernel instances is similar to a thread level parallelism. Each kernel instance gets data parallelism. Inside each kernel, it is still possible to use instruction level parallelism. Task parallelism (such as overlapped I/O) can still happen. It's easy to have thousands of kernel instances but it's simply impossible to have the same amounts of threads. This is the power of the stream.

[編集] Programming model notes

One of the drawbacks of SIMD programming was the issue of Array-of-Structures (AoS) and Structure-of-Arrays (SoA). Programmers often wanted to build data structures with a 'real' meaning, for example:

// A particle in a three dimensional space.
struct particle_t
    float x, y, z;          // not even an array!
    unsigned byte color[3]; // 8 bit per channel, say we care about RGB only
    float size;
    // ... and many other attributes may follow...

What happened is that those structures were then assembled in arrays too keep things nicely organized. This is AoS. When the structure is laid out in memory, the compiler will produce interleaved data, in the sense that all the structures will be contiguous but there will be a constant offset between, say, the "size" attribute of a structure instance and the same element of the following instance. The offset depends on the structure definition (and possibly other things not considered here such as compiler's policies). There are also other problems. For example, the three position variables cannot be SIMD-ized that way, because it's not sure they will be allocated in continuous memory space. To make sure SIMD operations can work on them, they shall be grouped in a 'packed memory location' or at least in an array. Another problem lies in both "color" and "xyz" to be defined in three-component vector quantities. SIMD processors usually have support for 4-components operations only (with some exceptions however).

This kind of problems and limitations made SIMD acceleration on standard CPUs quite nasty. The proposed solution, SoA follows as:

struct particle_t
    float *x, *y, *z;
    unsigned byte *colorRed, *colorBlue, *colorGreen;
    float *size;

For readers not experienced with C, the '*' before each identifier means 'array'. For Java programmers, this is roughly equivalent to "[]". The drawback here is that the various attributes could be spread in memory. To make sure not cause cache misses, we'll have to update all the various "reds", then all the "greens" and "blues". Although this is not so bad after all, it's simply overkill when compared what most stream processors offers.

For stream processors, the usage of structures is encouraged. From an application point of view, all the attributes can be defined with some flexibility. Taking GPUs as reference, there is a set of attributes (at least 16) available. For each attribute, the application can state the number of components and the format of the components (but only primitive data types are supported for now). The various attributes are then attached to a memory block, possibly defining a stride between 'consecutive' elements of the same attributes, effectively allowing interleaved data. When the GPU begins the stream processing, it will gather all the various attributes in a single set of parameters (usually this looks like a structure or a "magic global variable"), performs the operations and scatters the results to some memory area for later processing (or retrieving).

Summing up, there's more flexibility by application' side yet everything looks very organized on stream processor' side.

[編集] Generic processor architecture

Historically, CPUs began implementing various tiers of memory access optimizations because of the ever increasing performance when compared to relatively slow growing external memory bandwidth. As this gap widened, big amounts of die area were dedicated to hiding memory latencies. Since fetching information and opcodes to those few ALUs is expensive, very little die area is dedicated to actual mathematical machinery (as a rough estimation, consider it to be less than 10%).

A similar architecture exists on stream processors but thanks to the new programming model, the amount of transistor dedicated to management is actually very little.

Beginning from a whole system point of view, stream processors usually exist in a controlled environment. GPUs do exist on an add-in board (this seems to also apply to Imagine). CPUs do the dirty job of managing system resources, running applications and such.

The stream processor is usually equipped with a fast, efficient, proprietary memory bus (crossbar switches are now common, multi-buses has been employed in the past). The exact amount of memory lanes is dependant on the market range. As this is written, there are still 64bit wide interconnections around (entry-level). Most mid-range models use a fast 128bit crossbar switch matrix (4 or 2 segments), while high-end models deploy huge amounts of memory (actually up to 512MB) with a slightly slower crossbar 256bit wide. By contrast, standard processors from Intel Pentium to some Athlon 64 have only a single 64bit wide data bus.

Memory access patterns are much more predictable. While arrays do exist, their dimension is fixed at kernel invocation. The thing which most closely matches a multiple pointer indirection is an indirection chain, which is however guaranteed to finally read or write from a specific memory area (inside a stream).

Because of the SIMD nature of the stream processor's execution units (ALUs clusters), read/write operations are expected to happen in bulk, so memories are optimized for high bandwidth rather than low latency (this is a difference from Rambus and DDR SDRAM, for example). This also allows for efficient memory bus negotiations.

Most (90%) of a stream processor's work is done on-chip, requiring only 1% of the global data to be stored to memory. This is where knowing the kernel temporaries and dependencies pays.

Internally, a stream processor features some communication and management circuits but what's interesting is the Stream Register File (SRF). This is conceptually a large cache in which stream data is stored to be transferred to external memory in bulks. As a cache-like software-controlled structure to the various ALUs, the SRF is shared between all the various ALU clusters. The key concept and innovation here done with Stanford's Imagine chip is that the compiler is able to automate and allocate memory in an optimal way, fully transparent to the programmer. The dependencies between kernel functions and data is known through the programming model which enables the compiler to make an advanced flow analysis and optimally pack the SRFs and automate DMA. The hardware is also able to perform runtime synchronization to allow out-of-order execution of kernel functions. Commonly, this cache and DMA management can take up the majority of a project's schedule, something the stream processor (or at least Imagine) totally automates. Tests done at Stanford showed that the compiler did an as well or better job at scheduling memory than if you handtuned the thing with much effort.

There is proof, there can be only a lot of clusters because inter-cluster communication is assumed to be rare. Internally however, each cluster can efficiently exploit a much lower amount of ALUs because inter-cluster communication is common and thus needs to be highly efficient.

To keep those ALUs fetched with data, each ALU is equipped with Local Register Files (LRFs), which are basically its usable registers.

This three-tiered data access pattern, makes easy to keep temporary data away from slow memories, thus making the silicon implementation highly efficient and power-saving.

[編集] Hardware-in-the-loop issues

Although an order of magnitude speedup can easily be expected (even from mainstream GPUs when computing in a streaming manner), not all applications benefit from this. Communication latencies are actually the biggest problem. Although PCI Express improved this with full-duplex communications, getting a GPU (and possibly a generic stream processor) to work will possibly take long amounts of time. This means it's usually counter-productive to use them for small datasets. The stream architecture also incurs penalities for small streams, a behaviour which is officially identified as short stream effect. This basically happens because changing the kernel is a rather expensive operation.

Pipelining is a very radicated practice on stream processors, with GPUs featuring pipelines exceeding 200 stages. The cost for switching settings is dependant on the setting being modified but it's now considered to be always expensive. Although efforts are being spent for lowering the cost of switching, it's predictable this isn't going to happen any time soon. To avoid those problems at various levels of the pipeline, many techniques have been deployed such as "über shaders" and "texture atlases". Those techniques are actually game-oriented for the nature of GPUs, but the concepts are interesting for generic stream processing as well.

[編集] Interesting Stream Processors

Imagine, from the Stanford University is a very flexible architecture which has proven to be both fast and energy efficient. Built using 0.15 micrometre technology, it features 8 clusters of 6 ALUs each. In total, the chip features 48 pipelined ALUs with 16.9GB/s of off-chip memory bandwidth. When clocked at 96 MHz, up to 3.5 GFLOPs can be reached (MPEG2 encode) while dissipating only 1.6 W of core power using 12 V. Imagine is, proportionally, an order of magnitude faster than Pentium 4 and still more power efficient than DSPs^[2].
GPUs are recognized as widespread, consumer-grade stream processors. Although they are usually limited in hardware functionalities, a great deal of effort is spent on optimizing algorithms for this family of processors, which usually have very high horsepower. Various generations are to be noted by a stream processing point of view.
1. Pre-NV2x: no explicit support for stream processing. Kernel operations are usually hidden in the API and provide too little flexibility for general use.
2. NV2x: kernel stream operations are now explicitly under the programmer's control but only for vertex processing (fragments are still using old paradigms). No branching support severely hampers flexibility but some algorithms can be run (notably, low-precision fluid simulation).
3. RD3xx: increased performance and precision with limited support for branching/looping in both vertex and fragment processing. The model is now flexible enough to cover many purposes.
4. NV4x: actually (September 25, 2005) state of the art. Very flexible branching support although some limitations still exists on the number of operations to be executed and strict recursion depth. Performance is estimated to be from 20 to 44GFLOPs.
Not a true stream processor, the Cell processor is actually a generic CPU (the PPE) with tighly coupled SPEs, acting more like a MIMD architecture. The streaming-like architecture is actually tuned up for flexibility rather than high performance but the model still allows, by a high-level point of view to scale with the number of SPEs. Please note that scaling to a different amount of SPEs may require to recompile the application (while GPUs automatically do this). Various components can be seen as fitting in a streaming paradigm although the programming model and memory management differ and is more conventional without the stream advantages of higher productrivity.

[編集] References

^ Khailany, Dally, Rixner, Kapasi, Owens and Towles: "Exploring VLSI Scalability of Stream Processors", Stanford and Rice University.
^ Gummaraju and Rosenblum, "Stream processing in General-Purpose Processors", Stanford University.
^ Venkatasubramanian, "The Graphics Card as a Stream Computer", AT&T Labs - research.
^ Khailany, Dally, Rixner, Kapasi, Owens, Ho Ahn and Mattson, "Programmable Stream Processors", Universities of Stanford, Rice, California (Davis) and Reservoir Labs.

この「Stream processing」は、コンピュータに関連した書きかけです。この記事を加筆して下さる人を求めています。（Portal:コンピュータ）

"http://ja.wikipedia.org../../../s/t/r/Stream_processing.html" より作成