SSE2 SIMD implementation of Huffman encoding

Full-color compression speedups relative to libjpeg-turbo 1.4.2:

2.8 GHz Intel Xeon W3530, Linux, 64-bit:  2.2-18% (avg. 9.5%)
2.8 GHz Intel Xeon W3530, Linux, 32-bit:  10-25% (avg. 17%)

2.3 GHz AMD A10-4600M APU, Linux, 64-bit:  4.9-17% (avg. 11%)
2.3 GHz AMD A10-4600M APU, Linux, 32-bit:  8.8-19% (avg. 15%)

3.0 GHz Intel Core i7, OS X, 64-bit:  3.5-16% (avg. 10%)
3.0 GHz Intel Core i7, OS X, 32-bit:  4.8-14% (avg. 11%)

2.6 GHz AMD Athlon 64 X2 5050e:
Performance-neutral (give or take a few percent)

Full-color compression speedups relative to IPP:

2.8 GHz Intel Xeon W3530, Linux, 64-bit:  4.8-34% (avg. 19%)
2.8 GHz Intel Xeon W3530, Linux, 32-bit:  -19%-7.0% (avg. -7.0%)

Refer to #42 for discussion.  Numerous other approaches were attempted,
but this one proved to be the most performant across all platforms.

This commit also fixes #3 (works around, really-- the clang-compiled version
of jchuff.c still performs 20% worse than its GCC-compiled counterpart, but
that code is now bypassed by the new SSE2 Huffman algorithm.)

Based on:
https://github.com/mayeut/libjpeg-turbo/commit/2cb4d41330e1edc4469f6b97ba73b73abfbeb02f
https://github.com/mayeut/libjpeg-turbo/commit/36c94e050d117912adbff9fbcc6fe307df240168
18 files changed
tree: 6d3a1b20ccd56bc503233385e9ddc8faba6771d3
  1. cmakescripts/
  2. doc/
  3. java/
  4. md5/
  5. release/
  6. sharedlib/
  7. simd/
  8. testimages/
  9. win/
  10. .gitignore
  11. acinclude.m4
  12. bmp.c
  13. bmp.h
  14. BUILDING.md
  15. cderror.h
  16. cdjpeg.c
  17. cdjpeg.h
  18. change.log
  19. ChangeLog.txt
  20. cjpeg.1
  21. cjpeg.c
  22. CMakeLists.txt
  23. coderules.txt
  24. configure.ac
  25. djpeg.1
  26. djpeg.c
  27. doxygen-extra.css
  28. doxygen.config
  29. example.c
  30. jaricom.c
  31. jcapimin.c
  32. jcapistd.c
  33. jcarith.c
  34. jccoefct.c
  35. jccolext.c
  36. jccolor.c
  37. jcdctmgr.c
  38. jchuff.c
  39. jchuff.h
  40. jcinit.c
  41. jcmainct.c
  42. jcmarker.c
  43. jcmaster.c
  44. jcomapi.c
  45. jconfig.h.in
  46. jconfig.txt
  47. jconfigint.h.in
  48. jcparam.c
  49. jcphuff.c
  50. jcprepct.c
  51. jcsample.c
  52. jcstest.c
  53. jctrans.c
  54. jdapimin.c
  55. jdapistd.c
  56. jdarith.c
  57. jdatadst-tj.c
  58. jdatadst.c
  59. jdatasrc-tj.c
  60. jdatasrc.c
  61. jdcoefct.c
  62. jdcoefct.h
  63. jdcol565.c
  64. jdcolext.c
  65. jdcolor.c
  66. jdct.h
  67. jddctmgr.c
  68. jdhuff.c
  69. jdhuff.h
  70. jdinput.c
  71. jdmainct.c
  72. jdmainct.h
  73. jdmarker.c
  74. jdmaster.c
  75. jdmerge.c
  76. jdmrg565.c
  77. jdmrgext.c
  78. jdphuff.c
  79. jdpostct.c
  80. jdsample.c
  81. jdsample.h
  82. jdtrans.c
  83. jerror.c
  84. jerror.h
  85. jfdctflt.c
  86. jfdctfst.c
  87. jfdctint.c
  88. jidctflt.c
  89. jidctfst.c
  90. jidctint.c
  91. jidctred.c
  92. jinclude.h
  93. jmemmgr.c
  94. jmemnobs.c
  95. jmemsys.h
  96. jmorecfg.h
  97. jpeg_nbits_table.h
  98. jpegcomp.h
  99. jpegint.h
  100. jpeglib.h
  101. jpegtran.1
  102. jpegtran.c
  103. jquant1.c
  104. jquant2.c
  105. jsimd.h
  106. jsimd_none.c
  107. jsimddct.h
  108. jstdhuff.c
  109. jutils.c
  110. jversion.h
  111. libjpeg.map.in
  112. libjpeg.txt
  113. LICENSE.md
  114. Makefile.am
  115. rdbmp.c
  116. rdcolmap.c
  117. rdgif.c
  118. rdjpgcom.1
  119. rdjpgcom.c
  120. rdppm.c
  121. rdrle.c
  122. rdswitch.c
  123. rdtarga.c
  124. README.ijg
  125. README.md
  126. structure.txt
  127. tjbench.c
  128. tjbenchtest.in
  129. tjbenchtest.java.in
  130. tjexampletest.in
  131. tjunittest.c
  132. tjutil.c
  133. tjutil.h
  134. transupp.c
  135. transupp.h
  136. turbojpeg-jni.c
  137. turbojpeg-mapfile
  138. turbojpeg-mapfile.jni
  139. turbojpeg.c
  140. turbojpeg.h
  141. usage.txt
  142. wizard.txt
  143. wrbmp.c
  144. wrgif.c
  145. wrjpgcom.1
  146. wrjpgcom.c
  147. wrppm.c
  148. wrrle.c
  149. wrtarga.c