1.4.3.1. Huffman Code (Tree) Generation
It is observed that the frequency computation is the major bottleneck but not the tree generation from the codes. Following is the code snippet that computes the frequency:
unsigned int freq[256]
for (int i = 0; i < length; ++i) {
freq[data[i]]++;
}
- Improving worst case scenarios
- Improving threads
Worst Case Scenarios
With homogeneous data, throughput of the frequency computation is always one fourth of the best case. It is noticed that each frequency computation involves a load, add, and store operation (freq[ch]++). This means that if the adjacent characters in the data stream are similar or closer in the ASCII table (for example, A and B), the compiler and CPU cannot vectorize and pipeline it efficiently. Vectorization does multiple load, add, and store in parallel. This is because CPUs access caches in blocks (usually 64-bytes), and if same blocks are updated in adjacent instructions, compiler and CPU have to add barriers to make updates on adjacent locations sequentially. For more information, refer to the Prefetching section in Cache Optimizations topic.
You can solve this issue by adding a padding between the frequency table entries for each character. In case of the Huffman frequency counter, the loop was modified and updated to have 16 different tables. The data is accumulated after the loop ends.
unsigned int freq[256]
unsigned int freq_tab[16][256];
for (int i = 0; i < length/16; ++i) {
for (int j = 0; j < 16; ++j) {
freq_tab[j][data[i]]++;
}
}
for (int i = 0; i < 16; ++i) {
for (int j = 0; j < 256; ++j) {
freq[j] += freq_tab[i][j];
}
}
With the above distributed table update, the frequency computation throughput became identical for all cases.
Threads in Huffman Frequency Counter
Data is divided into multiple blocks and the frequency table of each block is computed independently in separate threads. The frequency tables are then merged at the end to generate the final table.
While experimenting, it was observed that as the number of threads increased, the frequency computation did not scale. This issue can be improved with CPU affinity of each thread by assigning threads to specific CPUs, to keep caches valid for longer than usual. By setting CPU affinity to threads, you can improve the best performance of the frequency computation (up to 15 GB/s).