Convert DOS line endings to Unix line endings

Should fix issues some of us have with `misc/dist/uwp_template/AppxManifest.xml` always showing up as modified. Might cause issues on Windows due to the removal of BOMs or change of line endings in some of the Mono, UWP or gradlew.bat files, we will test and adapt if need be.
author: Rémi Verschelde <rverschelde@gmail.com> 2017-11-05 11:37:59 +0100
committer: Rémi Verschelde <rverschelde@gmail.com> 2017-11-05 11:37:59 +0100
commit: 5bc2cf257b46b7ba52c95e43c9b0f91f6e06998e (patch)
tree: fe226ce29e8cef979492b4778c65bab6109191e5 /thirdparty
parent: a89fa34c21103430b1d140ee04c3ae6a433d77ce (diff)
6 files changed, 2063 insertions, 2063 deletions
diff --git a/thirdparty/etc2comp/AUTHORS b/thirdparty/etc2comp/AUTHORS
index 32daca27fe..e78a7f4d21 100644
--- a/thirdparty/etc2comp/AUTHORS
+++ b/thirdparty/etc2comp/AUTHORS
@@ -1,7 +1,7 @@
-# This is the list of Etc2Comp authors for copyright purposes.
-#
-# This does not necessarily list everyone who has contributed code, since in
-# some cases, their employer may be the copyright holder.  To see the full list
-# of contributors, see the revision history in source control.
-Google Inc.
-Blue Shift Inc.
+# This is the list of Etc2Comp authors for copyright purposes.
+#
+# This does not necessarily list everyone who has contributed code, since in
+# some cases, their employer may be the copyright holder.  To see the full list
+# of contributors, see the revision history in source control.
+Google Inc.
+Blue Shift Inc.
diff --git a/thirdparty/etc2comp/LICENSE b/thirdparty/etc2comp/LICENSE
index 75b52484ea..d645695673 100644
--- a/thirdparty/etc2comp/LICENSE
+++ b/thirdparty/etc2comp/LICENSE
@@ -1,202 +1,202 @@
-
-                                 Apache License
-                           Version 2.0, January 2004
-                        http://www.apache.org/licenses/
-
-   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-
-   1. Definitions.
-
-      "License" shall mean the terms and conditions for use, reproduction,
-      and distribution as defined by Sections 1 through 9 of this document.
-
-      "Licensor" shall mean the copyright owner or entity authorized by
-      the copyright owner that is granting the License.
-
-      "Legal Entity" shall mean the union of the acting entity and all
-      other entities that control, are controlled by, or are under common
-      control with that entity. For the purposes of this definition,
-      "control" means (i) the power, direct or indirect, to cause the
-      direction or management of such entity, whether by contract or
-      otherwise, or (ii) ownership of fifty percent (50%) or more of the
-      outstanding shares, or (iii) beneficial ownership of such entity.
-
-      "You" (or "Your") shall mean an individual or Legal Entity
-      exercising permissions granted by this License.
-
-      "Source" form shall mean the preferred form for making modifications,
-      including but not limited to software source code, documentation
-      source, and configuration files.
-
-      "Object" form shall mean any form resulting from mechanical
-      transformation or translation of a Source form, including but
-      not limited to compiled object code, generated documentation,
-      and conversions to other media types.
-
-      "Work" shall mean the work of authorship, whether in Source or
-      Object form, made available under the License, as indicated by a
-      copyright notice that is included in or attached to the work
-      (an example is provided in the Appendix below).
-
-      "Derivative Works" shall mean any work, whether in Source or Object
-      form, that is based on (or derived from) the Work and for which the
-      editorial revisions, annotations, elaborations, or other modifications
-      represent, as a whole, an original work of authorship. For the purposes
-      of this License, Derivative Works shall not include works that remain
-      separable from, or merely link (or bind by name) to the interfaces of,
-      the Work and Derivative Works thereof.
-
-      "Contribution" shall mean any work of authorship, including
-      the original version of the Work and any modifications or additions
-      to that Work or Derivative Works thereof, that is intentionally
-      submitted to Licensor for inclusion in the Work by the copyright owner
-      or by an individual or Legal Entity authorized to submit on behalf of
-      the copyright owner. For the purposes of this definition, "submitted"
-      means any form of electronic, verbal, or written communication sent
-      to the Licensor or its representatives, including but not limited to
-      communication on electronic mailing lists, source code control systems,
-      and issue tracking systems that are managed by, or on behalf of, the
-      Licensor for the purpose of discussing and improving the Work, but
-      excluding communication that is conspicuously marked or otherwise
-      designated in writing by the copyright owner as "Not a Contribution."
-
-      "Contributor" shall mean Licensor and any individual or Legal Entity
-      on behalf of whom a Contribution has been received by Licensor and
-      subsequently incorporated within the Work.
-
-   2. Grant of Copyright License. Subject to the terms and conditions of
-      this License, each Contributor hereby grants to You a perpetual,
-      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
-      copyright license to reproduce, prepare Derivative Works of,
-      publicly display, publicly perform, sublicense, and distribute the
-      Work and such Derivative Works in Source or Object form.
-
-   3. Grant of Patent License. Subject to the terms and conditions of
-      this License, each Contributor hereby grants to You a perpetual,
-      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
-      (except as stated in this section) patent license to make, have made,
-      use, offer to sell, sell, import, and otherwise transfer the Work,
-      where such license applies only to those patent claims licensable
-      by such Contributor that are necessarily infringed by their
-      Contribution(s) alone or by combination of their Contribution(s)
-      with the Work to which such Contribution(s) was submitted. If You
-      institute patent litigation against any entity (including a
-      cross-claim or counterclaim in a lawsuit) alleging that the Work
-      or a Contribution incorporated within the Work constitutes direct
-      or contributory patent infringement, then any patent licenses
-      granted to You under this License for that Work shall terminate
-      as of the date such litigation is filed.
-
-   4. Redistribution. You may reproduce and distribute copies of the
-      Work or Derivative Works thereof in any medium, with or without
-      modifications, and in Source or Object form, provided that You
-      meet the following conditions:
-
-      (a) You must give any other recipients of the Work or
-          Derivative Works a copy of this License; and
-
-      (b) You must cause any modified files to carry prominent notices
-          stating that You changed the files; and
-
-      (c) You must retain, in the Source form of any Derivative Works
-          that You distribute, all copyright, patent, trademark, and
-          attribution notices from the Source form of the Work,
-          excluding those notices that do not pertain to any part of
-          the Derivative Works; and
-
-      (d) If the Work includes a "NOTICE" text file as part of its
-          distribution, then any Derivative Works that You distribute must
-          include a readable copy of the attribution notices contained
-          within such NOTICE file, excluding those notices that do not
-          pertain to any part of the Derivative Works, in at least one
-          of the following places: within a NOTICE text file distributed
-          as part of the Derivative Works; within the Source form or
-          documentation, if provided along with the Derivative Works; or,
-          within a display generated by the Derivative Works, if and
-          wherever such third-party notices normally appear. The contents
-          of the NOTICE file are for informational purposes only and
-          do not modify the License. You may add Your own attribution
-          notices within Derivative Works that You distribute, alongside
-          or as an addendum to the NOTICE text from the Work, provided
-          that such additional attribution notices cannot be construed
-          as modifying the License.
-
-      You may add Your own copyright statement to Your modifications and
-      may provide additional or different license terms and conditions
-      for use, reproduction, or distribution of Your modifications, or
-      for any such Derivative Works as a whole, provided Your use,
-      reproduction, and distribution of the Work otherwise complies with
-      the conditions stated in this License.
-
-   5. Submission of Contributions. Unless You explicitly state otherwise,
-      any Contribution intentionally submitted for inclusion in the Work
-      by You to the Licensor shall be under the terms and conditions of
-      this License, without any additional terms or conditions.
-      Notwithstanding the above, nothing herein shall supersede or modify
-      the terms of any separate license agreement you may have executed
-      with Licensor regarding such Contributions.
-
-   6. Trademarks. This License does not grant permission to use the trade
-      names, trademarks, service marks, or product names of the Licensor,
-      except as required for reasonable and customary use in describing the
-      origin of the Work and reproducing the content of the NOTICE file.
-
-   7. Disclaimer of Warranty. Unless required by applicable law or
-      agreed to in writing, Licensor provides the Work (and each
-      Contributor provides its Contributions) on an "AS IS" BASIS,
-      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
-      implied, including, without limitation, any warranties or conditions
-      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
-      PARTICULAR PURPOSE. You are solely responsible for determining the
-      appropriateness of using or redistributing the Work and assume any
-      risks associated with Your exercise of permissions under this License.
-
-   8. Limitation of Liability. In no event and under no legal theory,
-      whether in tort (including negligence), contract, or otherwise,
-      unless required by applicable law (such as deliberate and grossly
-      negligent acts) or agreed to in writing, shall any Contributor be
-      liable to You for damages, including any direct, indirect, special,
-      incidental, or consequential damages of any character arising as a
-      result of this License or out of the use or inability to use the
-      Work (including but not limited to damages for loss of goodwill,
-      work stoppage, computer failure or malfunction, or any and all
-      other commercial damages or losses), even if such Contributor
-      has been advised of the possibility of such damages.
-
-   9. Accepting Warranty or Additional Liability. While redistributing
-      the Work or Derivative Works thereof, You may choose to offer,
-      and charge a fee for, acceptance of support, warranty, indemnity,
-      or other liability obligations and/or rights consistent with this
-      License. However, in accepting such obligations, You may act only
-      on Your own behalf and on Your sole responsibility, not on behalf
-      of any other Contributor, and only if You agree to indemnify,
-      defend, and hold each Contributor harmless for any liability
-      incurred by, or claims asserted against, such Contributor by reason
-      of your accepting any such warranty or additional liability.
-
-   END OF TERMS AND CONDITIONS
-
-   APPENDIX: How to apply the Apache License to your work.
-
-      To apply the Apache License to your work, attach the following
-      boilerplate notice, with the fields enclosed by brackets "[]"
-      replaced with your own identifying information. (Don't include
-      the brackets!)  The text should be enclosed in the appropriate
-      comment syntax for the file format. We also recommend that a
-      file or class name and description of purpose be included on the
-      same "printed page" as the copyright notice for easier
-      identification within third-party archives.
-
-   Copyright [yyyy] [name of copyright owner]
-
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/thirdparty/etc2comp/README.md b/thirdparty/etc2comp/README.md
index 1c70ae9f4e..2f4363d042 100644
--- a/thirdparty/etc2comp/README.md
+++ b/thirdparty/etc2comp/README.md
@@ -1,197 +1,197 @@
-# Etc2Comp - Texture to ETC2 compressor
-
-Etc2Comp is a command line tool that converts textures (e.g. bitmaps)
-into the [ETC2](https://en.wikipedia.org/wiki/Ericsson_Texture_Compression)
-format. The tool is built with a focus on encoding performance
-to reduce the amount of time required to compile asset heavy applications as
-well as reduce overall application size.
-
-This repo provides source code that can be compiled into a binary. The
-binary can then be used to convert textures to the ETC2 format.
-
-Important: This is not an official Google product. It is an experimental
-library published as-is. Please see the CONTRIBUTORS.md file for information
-about questions or issues.
-
-## Setup
-This project uses [CMake](https://cmake.org/) to generate platform-specific
-build files:
- - Linux: make files
- - OS X: Xcode workspace files
- - Microsoft Windows: Visual Studio solution files
- - Note: CMake supports other formats, but this doc only provides steps for
- one of each platform for brevity.
-
-Refer to each platform's setup section to setup your environment and build
-an Etc2Comp binary. Then skip to the usage section of this page for examples
-of how to use the library.
-
-### Setup for OS X
- build tested on this config:
-  OS X 10.9.5 i7 16GB RAM
-  Xcode 5.1.1
-  cmake 3.2.3
-  
-Start by downloading and installing the following components if they are not
-already installed on your development machine.
- - *Xcode* version 5.1.1, or greater
- - [CMake](https://cmake.org/download/) version 3.2.3, or greater
-
-To build the Etc2Comp binary:
- 1. Open a *Terminal* window and navigate to the project directory.
- 1. Run `mkdir build_xcode`
- 1. Run `cd build_xcode`
- 1. Run `cmake -G Xcode ../`
- 1. Open *Xcode* and import the `build_xcode/EtcTest.xcodeproj` file.
- 1. Open the Product menu and choose Build For -> Running.
- 1. Once the build succeeds the binary located at `build_xcode/EtcTool/Debug/EtcTool`
-can be executed.
-
-Optional
-Xcode EtcTool ‘Run’ preferences
-note: if the build_xcode/EtcTest.xcodeproj is manually deleted then some Xcode preferences 
-will need to be set by hand after cmake is run (these prefs are retained across 
-cmake updates if the .xcodeproj is not deleted/removed)
-
-1. Set the active scheme to ‘EtcTool’
-1. Edit the scheme
-1. Select option ‘Run EtcTool’, then tab ‘Arguments’. 
-Add this launch argument: ‘-argfile ../../EtcTool/args.txt’
-1. Select tab ‘Options’ and set a custom working directory to: ‘$(SRCROOT)/Build_Xcode/EtcTool’
-
-### SetUp for Windows
-
-1. Open a *Terminal* window and navigate to the project directory.
-1. Run `mkdir build_vs`
-1. Run `cd build_vs`
-1. Run CMAKE, noting what build version you need, and pointing to the parent directory as the source root; 
-  For VS 2013 : `cmake -G "Visual Studio 12 2013 Win64" ../`
-  For VS 2015 : `cmake -G "Visual Studio 14 2015 Win64" ../`
-  NOTE: To see what supported Visual Studio outputs there are, run `cmake -G`
-1. open the 'EtcTest' solution
-1. make the 'EtcTool' project the start up project 
-1. (optional) in the project properties, under 'Debugging ->command arguments' 
-add the argfile textfile thats included in the EtcTool directory. 
-example: -argfile C:\etc2\EtcTool\Args.txt
-
-### Setup For Linux
-The Linux build was tested on this config:
-  Ubuntu desktop 14.04
-  gcc/g++ 4.8
-  cmake 2.8.12.2
-
-1. Verify linux has cmake and C++-11 capable g++ installed
-1. Open shell
-1. Run `mkdir build_linux`
-1. Run `cd build_linux`
-1. Run `cmake ../`
-1. Run `make`
-1. navigate to the newly created EtcTool directory `cd EtcTool`
-1. run the executable: `./EtcTool -argfile ../../EtcTool/args.txt`
-
-Skip to the <a href="#usage">Usage</a> section for more information about using the
-tool.
-
-## Usage
-
-### Command Line Usage
-EtcTool can be run from the command line with the following usage:
-    etctool.exe source_image [options ...] -output encoded_image
-
-The encoder will use an array of RGBA floats read from the source_image to create 
-an ETC1 or ETC2 encoded image in encoded_image.  The RGBA floats should be in the 
-range [0:1].
-
-Options:
-
-    -analyze <analysis_folder>
-    -argfile <arg_file>           additional command line arguments read from a file
-    -blockAtHV <H V>              encodes a single block that contains the
-                                  pixel specified by the H V coordinates
-    -compare <comparison_image>   compares source_image to comparison_image
-    -effort <amount>              number between 0 and 100 to specify the encoding quality 
-                                  (100 is the highest quality)
-    -errormetric <error_metric>   specify the error metric, the options are
-                                  rgba, rgbx, rec709, numeric and normalxyz
-    -format <etc_format>          ETC1, RGB8, SRGB8, RGBA8, SRGB8, RGB8A1,
-                                  SRGB8A1 or R11
-    -help                         prints this message
-    -jobs or -j <thread_count>    specifies the number of threads (default=1)
-    -normalizexyz                 normalize RGB to have a length of 1
-    -verbose or -v                shows status information during the encoding
-                                  process
-	-mipmaps or -m <mip_count>    sets the maximum number of mipaps to generate (default=1)
-	-mipwrap or -w <x|y|xy>       sets the mipmap filter wrap mode (default=clamp)
-
-* -analyze will run an analysis of the encoding and place it in folder 
-"analysis_folder" (e.g. ../analysis/kodim05).  within the analysis_folder, a folder 
-will be created with a name of the current date/time (e.g. 20151204_153306).  this 
-date/time folder is used to compare encodings of the same texture over time.  
-within the date/time folder is a text file with several encoding stats and a 2x png 
-image showing the encoding mode for each 4x4 block.
-
-* -argfile allows additional command line arguments to be placed in a text file
-
-* -blockAtHV selects the 4x4 pixel subset of the source image at position (H,V).  
-This is mainly used for debugging
-
-* -compare compares the source image to the created encoded image. The encoding
-will dictate what error analysis is used in the comparison.
-
-* -effort uses an "amount" between 0 and 100 to determine how much additional effort 
-to apply during the encoding.
-
-* -errormetric selects the fitting algorithm used by the encoder.  "rgba" calculates 
-RMS error using RGB components that are weighted by A.  "rgbx" calculates RMS error 
-using RGBA components, where A is treated as an additional data channel, instead of 
-as alpha.  "rec709" is similar to "rgba", except the RGB components are also weighted 
-according to Rec709.  "numeric" calculates RMS error using unweighted RGBA components.  
-"normalize" calculates error based on dot product and vector length for RGB and RMS 
-error for A.
-
-* -help prints out the usage message
-
-* -jobs enables multi-threading to speed up image encoding
-
-* -normalizexyz normalizes the source RGB to have a length of 1.
-
-* -verbose shows information on the current encoding process. It will then display the 
-PSNR and time time it took to encode the image.
-
-* -mipmaps takes an argument that specifies how many mipmaps to generate from the 
-source image.  The mipmaps are generated with a lanczos3 filter using edge clamping.
-If the mipmaps option is not specified no mipmaps are created.
-
-* -mipwrap takes an argument that specifies the mipmap filter wrap mode.  The options 
-are "x", "y" and "xy" which specify wrapping in x only, y only or x and y respectively.
-The default options are clamping in both x and y.
-
-Note: Path names can use slashes or backslashes.  The tool will convert the 
-slashes to the appropriate polarity for the current platform.
-
-
-## API
-
-The library supports two different APIs - a C-like API that is not heavily 
-class-based and a class-based API.
-
-main() in EtcTool.cpp contains an example of both APIs.
-
-The Encode() method now returns an EncodingStatus that contains bit flags for
-reporting various warnings and flags encountered when encoding.
-
-
-## Copyright
-Copyright 2015 Etc2Comp Authors.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
+# Etc2Comp - Texture to ETC2 compressor
+
+Etc2Comp is a command line tool that converts textures (e.g. bitmaps)
+into the [ETC2](https://en.wikipedia.org/wiki/Ericsson_Texture_Compression)
+format. The tool is built with a focus on encoding performance
+to reduce the amount of time required to compile asset heavy applications as
+well as reduce overall application size.
+
+This repo provides source code that can be compiled into a binary. The
+binary can then be used to convert textures to the ETC2 format.
+
+Important: This is not an official Google product. It is an experimental
+library published as-is. Please see the CONTRIBUTORS.md file for information
+about questions or issues.
+
+## Setup
+This project uses [CMake](https://cmake.org/) to generate platform-specific
+build files:
+ - Linux: make files
+ - OS X: Xcode workspace files
+ - Microsoft Windows: Visual Studio solution files
+ - Note: CMake supports other formats, but this doc only provides steps for
+ one of each platform for brevity.
+
+Refer to each platform's setup section to setup your environment and build
+an Etc2Comp binary. Then skip to the usage section of this page for examples
+of how to use the library.
+
+### Setup for OS X
+ build tested on this config:
+  OS X 10.9.5 i7 16GB RAM
+  Xcode 5.1.1
+  cmake 3.2.3
+  
+Start by downloading and installing the following components if they are not
+already installed on your development machine.
+ - *Xcode* version 5.1.1, or greater
+ - [CMake](https://cmake.org/download/) version 3.2.3, or greater
+
+To build the Etc2Comp binary:
+ 1. Open a *Terminal* window and navigate to the project directory.
+ 1. Run `mkdir build_xcode`
+ 1. Run `cd build_xcode`
+ 1. Run `cmake -G Xcode ../`
+ 1. Open *Xcode* and import the `build_xcode/EtcTest.xcodeproj` file.
+ 1. Open the Product menu and choose Build For -> Running.
+ 1. Once the build succeeds the binary located at `build_xcode/EtcTool/Debug/EtcTool`
+can be executed.
+
+Optional
+Xcode EtcTool ‘Run’ preferences
+note: if the build_xcode/EtcTest.xcodeproj is manually deleted then some Xcode preferences 
+will need to be set by hand after cmake is run (these prefs are retained across 
+cmake updates if the .xcodeproj is not deleted/removed)
+
+1. Set the active scheme to ‘EtcTool’
+1. Edit the scheme
+1. Select option ‘Run EtcTool’, then tab ‘Arguments’. 
+Add this launch argument: ‘-argfile ../../EtcTool/args.txt’
+1. Select tab ‘Options’ and set a custom working directory to: ‘$(SRCROOT)/Build_Xcode/EtcTool’
+
+### SetUp for Windows
+
+1. Open a *Terminal* window and navigate to the project directory.
+1. Run `mkdir build_vs`
+1. Run `cd build_vs`
+1. Run CMAKE, noting what build version you need, and pointing to the parent directory as the source root; 
+  For VS 2013 : `cmake -G "Visual Studio 12 2013 Win64" ../`
+  For VS 2015 : `cmake -G "Visual Studio 14 2015 Win64" ../`
+  NOTE: To see what supported Visual Studio outputs there are, run `cmake -G`
+1. open the 'EtcTest' solution
+1. make the 'EtcTool' project the start up project 
+1. (optional) in the project properties, under 'Debugging ->command arguments' 
+add the argfile textfile thats included in the EtcTool directory. 
+example: -argfile C:\etc2\EtcTool\Args.txt
+
+### Setup For Linux
+The Linux build was tested on this config:
+  Ubuntu desktop 14.04
+  gcc/g++ 4.8
+  cmake 2.8.12.2
+
+1. Verify linux has cmake and C++-11 capable g++ installed
+1. Open shell
+1. Run `mkdir build_linux`
+1. Run `cd build_linux`
+1. Run `cmake ../`
+1. Run `make`
+1. navigate to the newly created EtcTool directory `cd EtcTool`
+1. run the executable: `./EtcTool -argfile ../../EtcTool/args.txt`
+
+Skip to the <a href="#usage">Usage</a> section for more information about using the
+tool.
+
+## Usage
+
+### Command Line Usage
+EtcTool can be run from the command line with the following usage:
+    etctool.exe source_image [options ...] -output encoded_image
+
+The encoder will use an array of RGBA floats read from the source_image to create 
+an ETC1 or ETC2 encoded image in encoded_image.  The RGBA floats should be in the 
+range [0:1].
+
+Options:
+
+    -analyze <analysis_folder>
+    -argfile <arg_file>           additional command line arguments read from a file
+    -blockAtHV <H V>              encodes a single block that contains the
+                                  pixel specified by the H V coordinates
+    -compare <comparison_image>   compares source_image to comparison_image
+    -effort <amount>              number between 0 and 100 to specify the encoding quality 
+                                  (100 is the highest quality)
+    -errormetric <error_metric>   specify the error metric, the options are
+                                  rgba, rgbx, rec709, numeric and normalxyz
+    -format <etc_format>          ETC1, RGB8, SRGB8, RGBA8, SRGB8, RGB8A1,
+                                  SRGB8A1 or R11
+    -help                         prints this message
+    -jobs or -j <thread_count>    specifies the number of threads (default=1)
+    -normalizexyz                 normalize RGB to have a length of 1
+    -verbose or -v                shows status information during the encoding
+                                  process
+	-mipmaps or -m <mip_count>    sets the maximum number of mipaps to generate (default=1)
+	-mipwrap or -w <x|y|xy>       sets the mipmap filter wrap mode (default=clamp)
+
+* -analyze will run an analysis of the encoding and place it in folder 
+"analysis_folder" (e.g. ../analysis/kodim05).  within the analysis_folder, a folder 
+will be created with a name of the current date/time (e.g. 20151204_153306).  this 
+date/time folder is used to compare encodings of the same texture over time.  
+within the date/time folder is a text file with several encoding stats and a 2x png 
+image showing the encoding mode for each 4x4 block.
+
+* -argfile allows additional command line arguments to be placed in a text file
+
+* -blockAtHV selects the 4x4 pixel subset of the source image at position (H,V).  
+This is mainly used for debugging
+
+* -compare compares the source image to the created encoded image. The encoding
+will dictate what error analysis is used in the comparison.
+
+* -effort uses an "amount" between 0 and 100 to determine how much additional effort 
+to apply during the encoding.
+
+* -errormetric selects the fitting algorithm used by the encoder.  "rgba" calculates 
+RMS error using RGB components that are weighted by A.  "rgbx" calculates RMS error 
+using RGBA components, where A is treated as an additional data channel, instead of 
+as alpha.  "rec709" is similar to "rgba", except the RGB components are also weighted 
+according to Rec709.  "numeric" calculates RMS error using unweighted RGBA components.  
+"normalize" calculates error based on dot product and vector length for RGB and RMS 
+error for A.
+
+* -help prints out the usage message
+
+* -jobs enables multi-threading to speed up image encoding
+
+* -normalizexyz normalizes the source RGB to have a length of 1.
+
+* -verbose shows information on the current encoding process. It will then display the 
+PSNR and time time it took to encode the image.
+
+* -mipmaps takes an argument that specifies how many mipmaps to generate from the 
+source image.  The mipmaps are generated with a lanczos3 filter using edge clamping.
+If the mipmaps option is not specified no mipmaps are created.
+
+* -mipwrap takes an argument that specifies the mipmap filter wrap mode.  The options 
+are "x", "y" and "xy" which specify wrapping in x only, y only or x and y respectively.
+The default options are clamping in both x and y.
+
+Note: Path names can use slashes or backslashes.  The tool will convert the 
+slashes to the appropriate polarity for the current platform.
+
+
+## API
+
+The library supports two different APIs - a C-like API that is not heavily 
+class-based and a class-based API.
+
+main() in EtcTool.cpp contains an example of both APIs.
+
+The Encode() method now returns an EncodingStatus that contains bit flags for
+reporting various warnings and flags encountered when encoding.
+
+
+## Copyright
+Copyright 2015 Etc2Comp Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
diff --git a/thirdparty/libtheora/x86_vc/mmxencfrag.c b/thirdparty/libtheora/x86_vc/mmxencfrag.c
index ac9dacf377..94f1d06513 100644
--- a/thirdparty/libtheora/x86_vc/mmxencfrag.c
+++ b/thirdparty/libtheora/x86_vc/mmxencfrag.c
@@ -1,969 +1,969 @@
-/********************************************************************
- *                                                                  *
- * THIS FILE IS PART OF THE OggTheora SOFTWARE CODEC SOURCE CODE.   *
- * USE, DISTRIBUTION AND REPRODUCTION OF THIS LIBRARY SOURCE IS     *
- * GOVERNED BY A BSD-STYLE SOURCE LICENSE INCLUDED WITH THIS SOURCE *
- * IN 'COPYING'. PLEASE READ THESE TERMS BEFORE DISTRIBUTING.       *
- *                                                                  *
- * THE Theora SOURCE CODE IS COPYRIGHT (C) 2002-2009                *
- * by the Xiph.Org Foundation http://www.xiph.org/                  *
- *                                                                  *
- ********************************************************************
-
-  function:
-  last mod: $Id: dsp_mmx.c 14579 2008-03-12 06:42:40Z xiphmont $
-
- ********************************************************************/
-#include <stddef.h>
-#include "x86enc.h"
-
-#if defined(OC_X86_ASM)
-
-unsigned oc_enc_frag_sad_mmxext(const unsigned char *_src,
- const unsigned char *_ref,int _ystride){
-  ptrdiff_t ret;
-  __asm{
-#define SRC esi
-#define REF edx
-#define YSTRIDE ecx
-#define YSTRIDE3 edi
-    mov YSTRIDE,_ystride
-    mov SRC,_src
-    mov REF,_ref
-    /*Load the first 4 rows of each block.*/
-    movq mm0,[SRC]
-    movq mm1,[REF]
-    movq mm2,[SRC][YSTRIDE]
-    movq mm3,[REF][YSTRIDE]
-    lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
-    movq mm4,[SRC+YSTRIDE*2]
-    movq mm5,[REF+YSTRIDE*2]
-    movq mm6,[SRC+YSTRIDE3]
-    movq mm7,[REF+YSTRIDE3]
-    /*Compute their SADs and add them in mm0*/
-    psadbw mm0,mm1
-    psadbw mm2,mm3
-    lea SRC,[SRC+YSTRIDE*4]
-    paddw mm0,mm2
-    lea REF,[REF+YSTRIDE*4]
-    /*Load the next 3 rows as registers become available.*/
-    movq mm2,[SRC]
-    movq mm3,[REF]
-    psadbw mm4,mm5
-    psadbw mm6,mm7
-    paddw mm0,mm4
-    movq mm5,[REF+YSTRIDE]
-    movq mm4,[SRC+YSTRIDE]
-    paddw mm0,mm6
-    movq mm7,[REF+YSTRIDE*2]
-    movq mm6,[SRC+YSTRIDE*2]
-    /*Start adding their SADs to mm0*/
-    psadbw mm2,mm3
-    psadbw mm4,mm5
-    paddw mm0,mm2
-    psadbw mm6,mm7
-    /*Load last row as registers become available.*/
-    movq mm2,[SRC+YSTRIDE3]
-    movq mm3,[REF+YSTRIDE3]
-    /*And finish adding up their SADs.*/
-    paddw mm0,mm4
-    psadbw mm2,mm3
-    paddw mm0,mm6
-    paddw mm0,mm2
-    movd [ret],mm0
-#undef SRC
-#undef REF
-#undef YSTRIDE
-#undef YSTRIDE3
-  }
-  return (unsigned)ret;
-}
-
-unsigned oc_enc_frag_sad_thresh_mmxext(const unsigned char *_src,
- const unsigned char *_ref,int _ystride,unsigned _thresh){
-  /*Early termination is for suckers.*/
-  return oc_enc_frag_sad_mmxext(_src,_ref,_ystride);
-}
-
-#define OC_SAD2_LOOP __asm{ \
-  /*We want to compute (mm0+mm1>>1) on unsigned bytes without overflow, but \
-     pavgb computes (mm0+mm1+1>>1). \
-   The latter is exactly 1 too large when the low bit of two corresponding \
-    bytes is only set in one of them. \
-   Therefore we pxor the operands, pand to mask out the low bits, and psubb to \
-    correct the output of pavgb.*/ \
-  __asm  movq mm6,mm0 \
-  __asm  lea REF1,[REF1+YSTRIDE*2] \
-  __asm  pxor mm0,mm1 \
-  __asm  pavgb mm6,mm1 \
-  __asm  lea REF2,[REF2+YSTRIDE*2] \
-  __asm  movq mm1,mm2 \
-  __asm  pand mm0,mm7 \
-  __asm  pavgb mm2,mm3 \
-  __asm  pxor mm1,mm3 \
-  __asm  movq mm3,[REF2+YSTRIDE] \
-  __asm  psubb mm6,mm0 \
-  __asm  movq mm0,[REF1] \
-  __asm  pand mm1,mm7 \
-  __asm  psadbw mm4,mm6 \
-  __asm  movd mm6,RET \
-  __asm  psubb mm2,mm1 \
-  __asm  movq mm1,[REF2] \
-  __asm  lea SRC,[SRC+YSTRIDE*2] \
-  __asm  psadbw mm5,mm2 \
-  __asm  movq mm2,[REF1+YSTRIDE] \
-  __asm  paddw mm5,mm4 \
-  __asm  movq mm4,[SRC] \
-  __asm  paddw mm6,mm5 \
-  __asm  movq mm5,[SRC+YSTRIDE] \
-  __asm  movd RET,mm6 \
-}
-
-/*Same as above, but does not pre-load the next two rows.*/
-#define OC_SAD2_TAIL __asm{ \
-  __asm  movq mm6,mm0 \
-  __asm  pavgb mm0,mm1 \
-  __asm  pxor mm6,mm1 \
-  __asm  movq mm1,mm2 \
-  __asm  pand mm6,mm7 \
-  __asm  pavgb mm2,mm3 \
-  __asm  pxor mm1,mm3 \
-  __asm  psubb mm0,mm6 \
-  __asm  pand mm1,mm7 \
-  __asm  psadbw mm4,mm0 \
-  __asm  psubb mm2,mm1 \
-  __asm  movd mm6,RET \
-  __asm  psadbw mm5,mm2 \
-  __asm  paddw mm5,mm4 \
-  __asm  paddw mm6,mm5 \
-  __asm  movd RET,mm6 \
-}
-
-unsigned oc_enc_frag_sad2_thresh_mmxext(const unsigned char *_src,
- const unsigned char *_ref1,const unsigned char *_ref2,int _ystride,
- unsigned _thresh){
-  ptrdiff_t ret;
-  __asm{
-#define REF1 ecx
-#define REF2 edi
-#define YSTRIDE esi
-#define SRC edx
-#define RET eax
-    mov YSTRIDE,_ystride
-    mov SRC,_src
-    mov REF1,_ref1
-    mov REF2,_ref2
-    movq mm0,[REF1]
-    movq mm1,[REF2]
-    movq mm2,[REF1+YSTRIDE]
-    movq mm3,[REF2+YSTRIDE]
-    xor RET,RET
-    movq mm4,[SRC]
-    pxor mm7,mm7
-    pcmpeqb mm6,mm6
-    movq mm5,[SRC+YSTRIDE]
-    psubb mm7,mm6
-    OC_SAD2_LOOP
-    OC_SAD2_LOOP
-    OC_SAD2_LOOP
-    OC_SAD2_TAIL
-    mov [ret],RET
-#undef REF1
-#undef REF2
-#undef YSTRIDE
-#undef SRC
-#undef RET
-  }
-  return (unsigned)ret;
-}
-
-/*Load an 8x4 array of pixel values from %[src] and %[ref] and compute their
-  16-bit difference in mm0...mm7.*/
-#define OC_LOAD_SUB_8x4(_off) __asm{ \
-  __asm  movd mm0,[_off+SRC] \
-  __asm  movd mm4,[_off+REF] \
-  __asm  movd mm1,[_off+SRC+SRC_YSTRIDE] \
-  __asm  lea SRC,[SRC+SRC_YSTRIDE*2] \
-  __asm  movd mm5,[_off+REF+REF_YSTRIDE] \
-  __asm  lea REF,[REF+REF_YSTRIDE*2] \
-  __asm  movd mm2,[_off+SRC] \
-  __asm  movd mm7,[_off+REF] \
-  __asm  movd mm3,[_off+SRC+SRC_YSTRIDE] \
-  __asm  movd mm6,[_off+REF+REF_YSTRIDE] \
-  __asm  punpcklbw mm0,mm4 \
-  __asm  lea SRC,[SRC+SRC_YSTRIDE*2] \
-  __asm  punpcklbw mm4,mm4 \
-  __asm  lea REF,[REF+REF_YSTRIDE*2] \
-  __asm  psubw mm0,mm4 \
-  __asm  movd mm4,[_off+SRC] \
-  __asm  movq [_off*2+BUF],mm0 \
-  __asm  movd mm0,[_off+REF] \
-  __asm  punpcklbw mm1,mm5 \
-  __asm  punpcklbw mm5,mm5 \
-  __asm  psubw mm1,mm5 \
-  __asm  movd mm5,[_off+SRC+SRC_YSTRIDE] \
-  __asm  punpcklbw mm2,mm7 \
-  __asm  punpcklbw mm7,mm7 \
-  __asm  psubw mm2,mm7 \
-  __asm  movd mm7,[_off+REF+REF_YSTRIDE] \
-  __asm  punpcklbw mm3,mm6 \
-  __asm  lea SRC,[SRC+SRC_YSTRIDE*2] \
-  __asm  punpcklbw mm6,mm6 \
-  __asm  psubw mm3,mm6 \
-  __asm  movd mm6,[_off+SRC] \
-  __asm  punpcklbw mm4,mm0 \
-  __asm  lea REF,[REF+REF_YSTRIDE*2] \
-  __asm  punpcklbw mm0,mm0 \
-  __asm  lea SRC,[SRC+SRC_YSTRIDE*2] \
-  __asm  psubw mm4,mm0 \
-  __asm  movd mm0,[_off+REF] \
-  __asm  punpcklbw mm5,mm7 \
-  __asm  neg SRC_YSTRIDE \
-  __asm  punpcklbw mm7,mm7 \
-  __asm  psubw mm5,mm7 \
-  __asm  movd mm7,[_off+SRC+SRC_YSTRIDE] \
-  __asm  punpcklbw mm6,mm0 \
-  __asm  lea REF,[REF+REF_YSTRIDE*2] \
-  __asm  punpcklbw mm0,mm0 \
-  __asm  neg REF_YSTRIDE \
-  __asm  psubw mm6,mm0 \
-  __asm  movd mm0,[_off+REF+REF_YSTRIDE] \
-  __asm  lea SRC,[SRC+SRC_YSTRIDE*8] \
-  __asm  punpcklbw mm7,mm0 \
-  __asm  neg SRC_YSTRIDE \
-  __asm  punpcklbw mm0,mm0 \
-  __asm  lea REF,[REF+REF_YSTRIDE*8] \
-  __asm  psubw mm7,mm0 \
-  __asm  neg REF_YSTRIDE \
-  __asm  movq mm0,[_off*2+BUF] \
-}
-
-/*Load an 8x4 array of pixel values from %[src] into %%mm0...%%mm7.*/
-#define OC_LOAD_8x4(_off) __asm{ \
-  __asm  movd mm0,[_off+SRC] \
-  __asm  movd mm1,[_off+SRC+YSTRIDE] \
-  __asm  movd mm2,[_off+SRC+YSTRIDE*2] \
-  __asm  pxor mm7,mm7 \
-  __asm  movd mm3,[_off+SRC+YSTRIDE3] \
-  __asm  punpcklbw mm0,mm7 \
-  __asm  movd mm4,[_off+SRC4] \
-  __asm  punpcklbw mm1,mm7 \
-  __asm  movd mm5,[_off+SRC4+YSTRIDE] \
-  __asm  punpcklbw mm2,mm7 \
-  __asm  movd mm6,[_off+SRC4+YSTRIDE*2] \
-  __asm  punpcklbw mm3,mm7 \
-  __asm  movd mm7,[_off+SRC4+YSTRIDE3] \
-  __asm  punpcklbw mm4,mm4 \
-  __asm  punpcklbw mm5,mm5 \
-  __asm  psrlw mm4,8 \
-  __asm  psrlw mm5,8 \
-  __asm  punpcklbw mm6,mm6 \
-  __asm  punpcklbw mm7,mm7 \
-  __asm  psrlw mm6,8 \
-  __asm  psrlw mm7,8 \
-}
-
-/*Performs the first two stages of an 8-point 1-D Hadamard transform.
-  The transform is performed in place, except that outputs 0-3 are swapped with
-   outputs 4-7.
-  Outputs 2, 3, 6 and 7 from the second stage are negated (which allows us to
-   perform this stage in place with no temporary registers).*/
-#define OC_HADAMARD_AB_8x4 __asm{ \
-  /*Stage A: \
-    Outputs 0-3 are swapped with 4-7 here.*/ \
-  __asm  paddw mm5,mm1 \
-  __asm  paddw mm6,mm2 \
-  __asm  paddw mm1,mm1 \
-  __asm  paddw mm2,mm2 \
-  __asm  psubw mm1,mm5 \
-  __asm  psubw mm2,mm6 \
-  __asm  paddw mm7,mm3 \
-  __asm  paddw mm4,mm0 \
-  __asm  paddw mm3,mm3 \
-  __asm  paddw mm0,mm0 \
-  __asm  psubw mm3,mm7 \
-  __asm  psubw mm0,mm4 \
-   /*Stage B:*/ \
-  __asm  paddw mm0,mm2 \
-  __asm  paddw mm1,mm3 \
-  __asm  paddw mm4,mm6 \
-  __asm  paddw mm5,mm7 \
-  __asm  paddw mm2,mm2 \
-  __asm  paddw mm3,mm3 \
-  __asm  paddw mm6,mm6 \
-  __asm  paddw mm7,mm7 \
-  __asm  psubw mm2,mm0 \
-  __asm  psubw mm3,mm1 \
-  __asm  psubw mm6,mm4 \
-  __asm  psubw mm7,mm5 \
-}
-
-/*Performs the last stage of an 8-point 1-D Hadamard transform in place.
-  Ouputs 1, 3, 5, and 7 are negated (which allows us to perform this stage in
-   place with no temporary registers).*/
-#define OC_HADAMARD_C_8x4 __asm{ \
-  /*Stage C:*/ \
-  __asm  paddw mm0,mm1 \
-  __asm  paddw mm2,mm3 \
-  __asm  paddw mm4,mm5 \
-  __asm  paddw mm6,mm7 \
-  __asm  paddw mm1,mm1 \
-  __asm  paddw mm3,mm3 \
-  __asm  paddw mm5,mm5 \
-  __asm  paddw mm7,mm7 \
-  __asm  psubw mm1,mm0 \
-  __asm  psubw mm3,mm2 \
-  __asm  psubw mm5,mm4 \
-  __asm  psubw mm7,mm6 \
-}
-
-/*Performs an 8-point 1-D Hadamard transform.
-  The transform is performed in place, except that outputs 0-3 are swapped with
-   outputs 4-7.
-  Outputs 1, 2, 5 and 6 are negated (which allows us to perform the transform
-   in place with no temporary registers).*/
-#define OC_HADAMARD_8x4 __asm{ \
-  OC_HADAMARD_AB_8x4 \
-  OC_HADAMARD_C_8x4 \
-}
-
-/*Performs the first part of the final stage of the Hadamard transform and
-   summing of absolute values.
-  At the end of this part, mm1 will contain the DC coefficient of the
-   transform.*/
-#define OC_HADAMARD_C_ABS_ACCUM_A_8x4(_r6,_r7) __asm{ \
-  /*We use the fact that \
-      (abs(a+b)+abs(a-b))/2=max(abs(a),abs(b)) \
-     to merge the final butterfly with the abs and the first stage of \
-     accumulation. \
-    Thus we can avoid using pabsw, which is not available until SSSE3. \
-    Emulating pabsw takes 3 instructions, so the straightforward MMXEXT \
-     implementation would be (3+3)*8+7=55 instructions (+4 for spilling \
-     registers). \
-    Even with pabsw, it would be (3+1)*8+7=39 instructions (with no spills). \
-    This implementation is only 26 (+4 for spilling registers).*/ \
-  __asm  movq [_r7+BUF],mm7 \
-  __asm  movq [_r6+BUF],mm6 \
-  /*mm7={0x7FFF}x4 \
-    mm0=max(abs(mm0),abs(mm1))-0x7FFF*/ \
-  __asm  pcmpeqb mm7,mm7 \
-  __asm  movq mm6,mm0 \
-  __asm  psrlw mm7,1 \
-  __asm  paddw mm6,mm1 \
-  __asm  pmaxsw mm0,mm1 \
-  __asm  paddsw mm6,mm7 \
-  __asm  psubw mm0,mm6 \
-  /*mm2=max(abs(mm2),abs(mm3))-0x7FFF \
-    mm4=max(abs(mm4),abs(mm5))-0x7FFF*/ \
-  __asm  movq mm6,mm2 \
-  __asm  movq mm1,mm4 \
-  __asm  pmaxsw mm2,mm3 \
-  __asm  pmaxsw mm4,mm5 \
-  __asm  paddw mm6,mm3 \
-  __asm  paddw mm1,mm5 \
-  __asm  movq mm3,[_r7+BUF] \
-}
-
-/*Performs the second part of the final stage of the Hadamard transform and
-   summing of absolute values.*/
-#define OC_HADAMARD_C_ABS_ACCUM_B_8x4(_r6,_r7) __asm{ \
-  __asm  paddsw mm6,mm7 \
-  __asm  movq mm5,[_r6+BUF] \
-  __asm  paddsw mm1,mm7 \
-  __asm  psubw mm2,mm6 \
-  __asm  psubw mm4,mm1 \
-  /*mm7={1}x4 (needed for the horizontal add that follows) \
-    mm0+=mm2+mm4+max(abs(mm3),abs(mm5))-0x7FFF*/ \
-  __asm  movq mm6,mm3 \
-  __asm  pmaxsw mm3,mm5 \
-  __asm  paddw mm0,mm2 \
-  __asm  paddw mm6,mm5 \
-  __asm  paddw mm0,mm4 \
-  __asm  paddsw mm6,mm7 \
-  __asm  paddw mm0,mm3 \
-  __asm  psrlw mm7,14 \
-  __asm  psubw mm0,mm6 \
-}
-
-/*Performs the last stage of an 8-point 1-D Hadamard transform, takes the
-   absolute value of each component, and accumulates everything into mm0.
-  This is the only portion of SATD which requires MMXEXT (we could use plain
-   MMX, but it takes 4 instructions and an extra register to work around the
-   lack of a pmaxsw, which is a pretty serious penalty).*/
-#define OC_HADAMARD_C_ABS_ACCUM_8x4(_r6,_r7) __asm{ \
-  OC_HADAMARD_C_ABS_ACCUM_A_8x4(_r6,_r7) \
-  OC_HADAMARD_C_ABS_ACCUM_B_8x4(_r6,_r7) \
-}
-
-/*Performs an 8-point 1-D Hadamard transform, takes the absolute value of each
-   component, and accumulates everything into mm0.
-  Note that mm0 will have an extra 4 added to each column, and that after
-   removing this value, the remainder will be half the conventional value.*/
-#define OC_HADAMARD_ABS_ACCUM_8x4(_r6,_r7) __asm{ \
-  OC_HADAMARD_AB_8x4 \
-  OC_HADAMARD_C_ABS_ACCUM_8x4(_r6,_r7) \
-}
-
-/*Performs two 4x4 transposes (mostly) in place.
-  On input, {mm0,mm1,mm2,mm3} contains rows {e,f,g,h}, and {mm4,mm5,mm6,mm7}
-   contains rows {a,b,c,d}.
-  On output, {0x40,0x50,0x60,0x70}+_off+BUF contains {e,f,g,h}^T, and
-   {mm4,mm5,mm6,mm7} contains the transposed rows {a,b,c,d}^T.*/
-#define OC_TRANSPOSE_4x4x2(_off) __asm{ \
-  /*First 4x4 transpose:*/ \
-  __asm  movq [0x10+_off+BUF],mm5 \
-  /*mm0 = e3 e2 e1 e0 \
-    mm1 = f3 f2 f1 f0 \
-    mm2 = g3 g2 g1 g0 \
-    mm3 = h3 h2 h1 h0*/ \
-  __asm  movq mm5,mm2 \
-  __asm  punpcklwd mm2,mm3 \
-  __asm  punpckhwd mm5,mm3 \
-  __asm  movq mm3,mm0 \
-  __asm  punpcklwd mm0,mm1 \
-  __asm  punpckhwd mm3,mm1 \
-  /*mm0 = f1 e1 f0 e0 \
-    mm3 = f3 e3 f2 e2 \
-    mm2 = h1 g1 h0 g0 \
-    mm5 = h3 g3 h2 g2*/ \
-  __asm  movq mm1,mm0 \
-  __asm  punpckldq mm0,mm2 \
-  __asm  punpckhdq mm1,mm2 \
-  __asm  movq mm2,mm3 \
-  __asm  punpckhdq mm3,mm5 \
-  __asm  movq [0x40+_off+BUF],mm0 \
-  __asm  punpckldq mm2,mm5 \
-  /*mm0 = h0 g0 f0 e0 \
-    mm1 = h1 g1 f1 e1 \
-    mm2 = h2 g2 f2 e2 \
-    mm3 = h3 g3 f3 e3*/ \
-  __asm  movq mm5,[0x10+_off+BUF] \
-  /*Second 4x4 transpose:*/ \
-  /*mm4 = a3 a2 a1 a0 \
-    mm5 = b3 b2 b1 b0 \
-    mm6 = c3 c2 c1 c0 \
-    mm7 = d3 d2 d1 d0*/ \
-  __asm  movq mm0,mm6 \
-  __asm  punpcklwd mm6,mm7 \
-  __asm  movq [0x50+_off+BUF],mm1 \
-  __asm  punpckhwd mm0,mm7 \
-  __asm  movq mm7,mm4 \
-  __asm  punpcklwd mm4,mm5 \
-  __asm  movq [0x60+_off+BUF],mm2 \
-  __asm  punpckhwd mm7,mm5 \
-  /*mm4 = b1 a1 b0 a0 \
-    mm7 = b3 a3 b2 a2 \
-    mm6 = d1 c1 d0 c0 \
-    mm0 = d3 c3 d2 c2*/ \
-  __asm  movq mm5,mm4 \
-  __asm  punpckldq mm4,mm6 \
-  __asm  movq [0x70+_off+BUF],mm3 \
-  __asm  punpckhdq mm5,mm6 \
-  __asm  movq mm6,mm7 \
-  __asm  punpckhdq mm7,mm0 \
-  __asm  punpckldq mm6,mm0 \
-  /*mm4 = d0 c0 b0 a0 \
-    mm5 = d1 c1 b1 a1 \
-    mm6 = d2 c2 b2 a2 \
-    mm7 = d3 c3 b3 a3*/ \
-}
-
-static unsigned oc_int_frag_satd_thresh_mmxext(const unsigned char *_src,
- int _src_ystride,const unsigned char *_ref,int _ref_ystride,unsigned _thresh){
-  OC_ALIGN8(ogg_int16_t  buf[64]);
-  ogg_int16_t           *bufp;
-  unsigned               ret1;
-  unsigned               ret2;
-  bufp=buf;
-  __asm{
-#define SRC esi
-#define REF eax
-#define SRC_YSTRIDE ecx
-#define REF_YSTRIDE edx
-#define BUF edi
-#define RET eax
-#define RET2 edx
-    mov SRC,_src
-    mov SRC_YSTRIDE,_src_ystride
-    mov REF,_ref
-    mov REF_YSTRIDE,_ref_ystride
-    mov BUF,bufp
-    OC_LOAD_SUB_8x4(0x00)
-    OC_HADAMARD_8x4
-    OC_TRANSPOSE_4x4x2(0x00)
-    /*Finish swapping out this 8x4 block to make room for the next one.
-      mm0...mm3 have been swapped out already.*/
-    movq [0x00+BUF],mm4
-    movq [0x10+BUF],mm5
-    movq [0x20+BUF],mm6
-    movq [0x30+BUF],mm7
-    OC_LOAD_SUB_8x4(0x04)
-    OC_HADAMARD_8x4
-    OC_TRANSPOSE_4x4x2(0x08)
-    /*Here the first 4x4 block of output from the last transpose is the second
-       4x4 block of input for the next transform.
-      We have cleverly arranged that it already be in the appropriate place, so
-       we only have to do half the loads.*/
-    movq mm1,[0x10+BUF]
-    movq mm2,[0x20+BUF]
-    movq mm3,[0x30+BUF]
-    movq mm0,[0x00+BUF]
-    OC_HADAMARD_ABS_ACCUM_8x4(0x28,0x38)
-    /*Up to this point, everything fit in 16 bits (8 input + 1 for the
-       difference + 2*3 for the two 8-point 1-D Hadamards - 1 for the abs - 1
-       for the factor of two we dropped + 3 for the vertical accumulation).
-      Now we finally have to promote things to dwords.
-      We break this part out of OC_HADAMARD_ABS_ACCUM_8x4 to hide the long
-       latency of pmaddwd by starting the next series of loads now.*/
-    mov RET2,_thresh
-    pmaddwd mm0,mm7
-    movq mm1,[0x50+BUF]
-    movq mm5,[0x58+BUF]
-    movq mm4,mm0
-    movq mm2,[0x60+BUF]
-    punpckhdq mm0,mm0
-    movq mm6,[0x68+BUF]
-    paddd mm4,mm0
-    movq mm3,[0x70+BUF]
-    movd RET,mm4
-    movq mm7,[0x78+BUF]
-    /*The sums produced by OC_HADAMARD_ABS_ACCUM_8x4 each have an extra 4
-       added to them, and a factor of two removed; correct the final sum here.*/
-    lea RET,[RET+RET-32]
-    movq mm0,[0x40+BUF]
-    cmp RET,RET2
-    movq mm4,[0x48+BUF]
-    jae at_end
-    OC_HADAMARD_ABS_ACCUM_8x4(0x68,0x78)
-    pmaddwd mm0,mm7
-    /*There isn't much to stick in here to hide the latency this time, but the
-       alternative to pmaddwd is movq->punpcklwd->punpckhwd->paddd, whose
-       latency is even worse.*/
-    sub RET,32
-    movq mm4,mm0
-    punpckhdq mm0,mm0
-    paddd mm4,mm0
-    movd RET2,mm4
-    lea RET,[RET+RET2*2]
-    align 16
-at_end:
-    mov ret1,RET
-#undef SRC
-#undef REF
-#undef SRC_YSTRIDE
-#undef REF_YSTRIDE
-#undef BUF
-#undef RET
-#undef RET2
-  }
-  return ret1;
-}
-
-unsigned oc_enc_frag_satd_thresh_mmxext(const unsigned char *_src,
- const unsigned char *_ref,int _ystride,unsigned _thresh){
-  return oc_int_frag_satd_thresh_mmxext(_src,_ystride,_ref,_ystride,_thresh);
-}
-
-
-/*Our internal implementation of frag_copy2 takes an extra stride parameter so
-   we can share code with oc_enc_frag_satd2_thresh_mmxext().*/
-static void oc_int_frag_copy2_mmxext(unsigned char *_dst,int _dst_ystride,
- const unsigned char *_src1,const unsigned char *_src2,int _src_ystride){
-  __asm{
-    /*Load the first 3 rows.*/
-#define DST_YSTRIDE edi
-#define SRC_YSTRIDE esi
-#define DST eax
-#define SRC1 edx
-#define SRC2 ecx
-    mov DST_YSTRIDE,_dst_ystride
-    mov SRC_YSTRIDE,_src_ystride
-    mov DST,_dst
-    mov SRC1,_src1
-    mov SRC2,_src2
-    movq mm0,[SRC1]
-    movq mm1,[SRC2]
-    movq mm2,[SRC1+SRC_YSTRIDE]
-    lea SRC1,[SRC1+SRC_YSTRIDE*2]
-    movq mm3,[SRC2+SRC_YSTRIDE]
-    lea SRC2,[SRC2+SRC_YSTRIDE*2]
-    pxor mm7,mm7
-    movq mm4,[SRC1]
-    pcmpeqb mm6,mm6
-    movq mm5,[SRC2]
-    /*mm7={1}x8.*/
-    psubb mm7,mm6
-    /*Start averaging mm0 and mm1 into mm6.*/
-    movq mm6,mm0
-    pxor mm0,mm1
-    pavgb mm6,mm1
-    /*mm1 is free, start averaging mm3 into mm2 using mm1.*/
-    movq mm1,mm2
-    pand mm0,mm7
-    pavgb mm2,mm3
-    pxor mm1,mm3
-    /*mm3 is free.*/
-    psubb mm6,mm0
-    /*mm0 is free, start loading the next row.*/
-    movq mm0,[SRC1+SRC_YSTRIDE]
-    /*Start averaging mm5 and mm4 using mm3.*/
-    movq mm3,mm4
-    /*mm6 [row 0] is done; write it out.*/
-    movq [DST],mm6
-    pand mm1,mm7
-    pavgb mm4,mm5
-    psubb mm2,mm1
-    /*mm1 is free, continue loading the next row.*/
-    movq mm1,[SRC2+SRC_YSTRIDE]
-    pxor mm3,mm5
-    lea SRC1,[SRC1+SRC_YSTRIDE*2]
-    /*mm2 [row 1] is done; write it out.*/
-    movq [DST+DST_YSTRIDE],mm2
-    pand mm3,mm7
-    /*Start loading the next row.*/
-    movq mm2,[SRC1]
-    lea DST,[DST+DST_YSTRIDE*2]
-    psubb mm4,mm3
-    lea SRC2,[SRC2+SRC_YSTRIDE*2]
-    /*mm4 [row 2] is done; write it out.*/
-    movq [DST],mm4
-    /*Continue loading the next row.*/
-    movq mm3,[SRC2]
-    /*Start averaging mm0 and mm1 into mm6.*/
-    movq mm6,mm0
-    pxor mm0,mm1
-    /*Start loading the next row.*/
-    movq mm4,[SRC1+SRC_YSTRIDE]
-    pavgb mm6,mm1
-    /*mm1 is free; start averaging mm3 into mm2 using mm1.*/
-    movq mm1,mm2
-    pand mm0,mm7
-    /*Continue loading the next row.*/
-    movq mm5,[SRC2+SRC_YSTRIDE]
-    pavgb mm2,mm3
-    lea SRC1,[SRC1+SRC_YSTRIDE*2]
-    pxor mm1,mm3
-    /*mm3 is free.*/
-    psubb mm6,mm0
-    /*mm0 is free, start loading the next row.*/
-    movq mm0,[SRC1]
-    /*Start averaging mm5 into mm4 using mm3.*/
-    movq mm3,mm4
-    /*mm6 [row 3] is done; write it out.*/
-    movq [DST+DST_YSTRIDE],mm6
-    pand mm1,mm7
-    lea SRC2,[SRC2+SRC_YSTRIDE*2]
-    pavgb mm4,mm5
-    lea DST,[DST+DST_YSTRIDE*2]
-    psubb mm2,mm1
-    /*mm1 is free; continue loading the next row.*/
-    movq mm1,[SRC2]
-    pxor mm3,mm5
-    /*mm2 [row 4] is done; write it out.*/
-    movq [DST],mm2
-    pand mm3,mm7
-    /*Start loading the next row.*/
-    movq mm2,[SRC1+SRC_YSTRIDE]
-    psubb mm4,mm3
-    /*Start averaging mm0 and mm1 into mm6.*/
-    movq mm6,mm0
-    /*Continue loading the next row.*/
-    movq mm3,[SRC2+SRC_YSTRIDE]
-    /*mm4 [row 5] is done; write it out.*/
-    movq [DST+DST_YSTRIDE],mm4
-    pxor mm0,mm1
-    pavgb mm6,mm1
-    /*mm4 is free; start averaging mm3 into mm2 using mm4.*/
-    movq mm4,mm2
-    pand mm0,mm7
-    pavgb mm2,mm3
-    pxor mm4,mm3
-    lea DST,[DST+DST_YSTRIDE*2]
-    psubb mm6,mm0
-    pand mm4,mm7
-    /*mm6 [row 6] is done, write it out.*/
-    movq [DST],mm6
-    psubb mm2,mm4
-    /*mm2 [row 7] is done, write it out.*/
-    movq [DST+DST_YSTRIDE],mm2
-#undef SRC1
-#undef SRC2
-#undef SRC_YSTRIDE
-#undef DST_YSTRIDE
-#undef DST
-  }
-}
-
-unsigned oc_enc_frag_satd2_thresh_mmxext(const unsigned char *_src,
- const unsigned char *_ref1,const unsigned char *_ref2,int _ystride,
- unsigned _thresh){
-  OC_ALIGN8(unsigned char ref[64]);
-  oc_int_frag_copy2_mmxext(ref,8,_ref1,_ref2,_ystride);
-  return oc_int_frag_satd_thresh_mmxext(_src,_ystride,ref,8,_thresh);
-}
-
-unsigned oc_enc_frag_intra_satd_mmxext(const unsigned char *_src,
- int _ystride){
-  OC_ALIGN8(ogg_int16_t  buf[64]);
-  ogg_int16_t           *bufp;
-  unsigned               ret1;
-  unsigned               ret2;
-  bufp=buf;
-  __asm{
-#define SRC eax
-#define SRC4 esi
-#define BUF edi
-#define RET eax
-#define RET_WORD ax
-#define RET2 ecx
-#define YSTRIDE edx
-#define YSTRIDE3 ecx
-    mov SRC,_src
-    mov BUF,bufp
-    mov YSTRIDE,_ystride
-    /* src4 = src+4*ystride */
-    lea SRC4,[SRC+YSTRIDE*4]
-    /* ystride3 = 3*ystride */
-    lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
-    OC_LOAD_8x4(0x00)
-    OC_HADAMARD_8x4
-    OC_TRANSPOSE_4x4x2(0x00)
-    /*Finish swapping out this 8x4 block to make room for the next one.
-      mm0...mm3 have been swapped out already.*/
-    movq [0x00+BUF],mm4
-    movq [0x10+BUF],mm5
-    movq [0x20+BUF],mm6
-    movq [0x30+BUF],mm7
-    OC_LOAD_8x4(0x04)
-    OC_HADAMARD_8x4
-    OC_TRANSPOSE_4x4x2(0x08)
-    /*Here the first 4x4 block of output from the last transpose is the second
-      4x4 block of input for the next transform.
-      We have cleverly arranged that it already be in the appropriate place, so
-      we only have to do half the loads.*/
-    movq mm1,[0x10+BUF]
-    movq mm2,[0x20+BUF]
-    movq mm3,[0x30+BUF]
-    movq mm0,[0x00+BUF]
-    /*We split out the stages here so we can save the DC coefficient in the
-      middle.*/
-    OC_HADAMARD_AB_8x4
-    OC_HADAMARD_C_ABS_ACCUM_A_8x4(0x28,0x38)
-    movd RET,mm1
-    OC_HADAMARD_C_ABS_ACCUM_B_8x4(0x28,0x38)
-    /*Up to this point, everything fit in 16 bits (8 input + 1 for the
-      difference + 2*3 for the two 8-point 1-D Hadamards - 1 for the abs - 1
-      for the factor of two we dropped + 3 for the vertical accumulation).
-      Now we finally have to promote things to dwords.
-      We break this part out of OC_HADAMARD_ABS_ACCUM_8x4 to hide the long
-      latency of pmaddwd by starting the next series of loads now.*/
-    pmaddwd mm0,mm7
-    movq mm1,[0x50+BUF]
-    movq mm5,[0x58+BUF]
-    movq mm2,[0x60+BUF]
-    movq mm4,mm0
-    movq mm6,[0x68+BUF]
-    punpckhdq mm0,mm0
-    movq mm3,[0x70+BUF]
-    paddd mm4,mm0
-    movq mm7,[0x78+BUF]
-    movd RET2,mm4
-    movq mm0,[0x40+BUF]
-    movq mm4,[0x48+BUF]
-    OC_HADAMARD_ABS_ACCUM_8x4(0x68,0x78)
-    pmaddwd mm0,mm7
-    /*We assume that the DC coefficient is always positive (which is true,
-    because the input to the INTRA transform was not a difference).*/
-    movzx RET,RET_WORD
-    add RET2,RET2
-    sub RET2,RET
-    movq mm4,mm0
-    punpckhdq mm0,mm0
-    paddd mm4,mm0
-    movd RET,mm4
-    lea RET,[-64+RET2+RET*2]
-    mov [ret1],RET
-#undef SRC
-#undef SRC4
-#undef BUF
-#undef RET
-#undef RET_WORD
-#undef RET2
-#undef YSTRIDE
-#undef YSTRIDE3
-  }
-  return ret1;
-}
-
-void oc_enc_frag_sub_mmx(ogg_int16_t _residue[64],
- const unsigned char *_src, const unsigned char *_ref,int _ystride){
-  int i;
-  __asm  pxor mm7,mm7
-  for(i=4;i-->0;){
-    __asm{
-#define SRC edx
-#define YSTRIDE esi
-#define RESIDUE eax
-#define REF ecx
-      mov YSTRIDE,_ystride
-      mov RESIDUE,_residue
-      mov SRC,_src
-      mov REF,_ref
-      /*mm0=[src]*/
-      movq mm0,[SRC]
-      /*mm1=[ref]*/
-      movq mm1,[REF]
-      /*mm4=[src+ystride]*/
-      movq mm4,[SRC+YSTRIDE]
-      /*mm5=[ref+ystride]*/
-      movq mm5,[REF+YSTRIDE]
-      /*Compute [src]-[ref].*/
-      movq mm2,mm0
-      punpcklbw mm0,mm7
-      movq mm3,mm1
-      punpckhbw mm2,mm7
-      punpcklbw mm1,mm7
-      punpckhbw mm3,mm7
-      psubw mm0,mm1
-      psubw mm2,mm3
-      /*Compute [src+ystride]-[ref+ystride].*/
-      movq mm1,mm4
-      punpcklbw mm4,mm7
-      movq mm3,mm5
-      punpckhbw mm1,mm7
-      lea SRC,[SRC+YSTRIDE*2]
-      punpcklbw mm5,mm7
-      lea REF,[REF+YSTRIDE*2]
-      punpckhbw mm3,mm7
-      psubw mm4,mm5
-      psubw mm1,mm3
-      /*Write the answer out.*/
-      movq [RESIDUE+0x00],mm0
-      movq [RESIDUE+0x08],mm2
-      movq [RESIDUE+0x10],mm4
-      movq [RESIDUE+0x18],mm1
-      lea RESIDUE,[RESIDUE+0x20]
-      mov _residue,RESIDUE
-      mov _src,SRC
-      mov _ref,REF
-#undef SRC
-#undef YSTRIDE
-#undef RESIDUE
-#undef REF
-    }
-  }
-}
-
-void oc_enc_frag_sub_128_mmx(ogg_int16_t _residue[64],
- const unsigned char *_src,int _ystride){
-   __asm{
-#define YSTRIDE edx
-#define YSTRIDE3 edi
-#define RESIDUE ecx
-#define SRC eax
-    mov YSTRIDE,_ystride
-    mov RESIDUE,_residue
-    mov SRC,_src
-    /*mm0=[src]*/
-    movq mm0,[SRC]
-    /*mm1=[src+ystride]*/
-    movq mm1,[SRC+YSTRIDE]
-    /*mm6={-1}x4*/
-    pcmpeqw mm6,mm6
-    /*mm2=[src+2*ystride]*/
-    movq mm2,[SRC+YSTRIDE*2]
-    /*[ystride3]=3*[ystride]*/
-    lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
-    /*mm6={1}x4*/
-    psllw mm6,15
-    /*mm3=[src+3*ystride]*/
-    movq mm3,[SRC+YSTRIDE3]
-    /*mm6={128}x4*/
-    psrlw mm6,8
-    /*mm7=0*/ 
-    pxor mm7,mm7
-    /*[src]=[src]+4*[ystride]*/
-    lea SRC,[SRC+YSTRIDE*4]
-    /*Compute [src]-128 and [src+ystride]-128*/
-    movq mm4,mm0
-    punpcklbw mm0,mm7
-    movq mm5,mm1
-    punpckhbw mm4,mm7
-    psubw mm0,mm6
-    punpcklbw mm1,mm7
-    psubw mm4,mm6
-    punpckhbw mm5,mm7
-    psubw mm1,mm6
-    psubw mm5,mm6
-    /*Write the answer out.*/
-    movq [RESIDUE+0x00],mm0
-    movq [RESIDUE+0x08],mm4
-    movq [RESIDUE+0x10],mm1
-    movq [RESIDUE+0x18],mm5
-    /*mm0=[src+4*ystride]*/
-    movq mm0,[SRC]
-    /*mm1=[src+5*ystride]*/
-    movq mm1,[SRC+YSTRIDE]
-    /*Compute [src+2*ystride]-128 and [src+3*ystride]-128*/
-    movq mm4,mm2
-    punpcklbw mm2,mm7
-    movq mm5,mm3
-    punpckhbw mm4,mm7
-    psubw mm2,mm6
-    punpcklbw mm3,mm7
-    psubw mm4,mm6
-    punpckhbw mm5,mm7
-    psubw mm3,mm6
-    psubw mm5,mm6
-    /*Write the answer out.*/
-    movq [RESIDUE+0x20],mm2
-    movq [RESIDUE+0x28],mm4
-    movq [RESIDUE+0x30],mm3
-    movq [RESIDUE+0x38],mm5
-    /*Compute [src+6*ystride]-128 and [src+7*ystride]-128*/
-    movq mm2,[SRC+YSTRIDE*2]
-    movq mm3,[SRC+YSTRIDE3]
-    movq mm4,mm0
-    punpcklbw mm0,mm7
-    movq mm5,mm1
-    punpckhbw mm4,mm7
-    psubw mm0,mm6
-    punpcklbw mm1,mm7
-    psubw mm4,mm6
-    punpckhbw mm5,mm7
-    psubw mm1,mm6
-    psubw mm5,mm6
-    /*Write the answer out.*/
-    movq [RESIDUE+0x40],mm0
-    movq [RESIDUE+0x48],mm4
-    movq [RESIDUE+0x50],mm1
-    movq [RESIDUE+0x58],mm5
-    /*Compute [src+6*ystride]-128 and [src+7*ystride]-128*/
-    movq mm4,mm2
-    punpcklbw mm2,mm7
-    movq mm5,mm3
-    punpckhbw mm4,mm7
-    psubw mm2,mm6
-    punpcklbw mm3,mm7
-    psubw mm4,mm6
-    punpckhbw mm5,mm7
-    psubw mm3,mm6
-    psubw mm5,mm6
-    /*Write the answer out.*/
-    movq [RESIDUE+0x60],mm2
-    movq [RESIDUE+0x68],mm4
-    movq [RESIDUE+0x70],mm3
-    movq [RESIDUE+0x78],mm5
-#undef YSTRIDE
-#undef YSTRIDE3
-#undef RESIDUE
-#undef SRC
-  }
-}
-
-void oc_enc_frag_copy2_mmxext(unsigned char *_dst,
- const unsigned char *_src1,const unsigned char *_src2,int _ystride){
-  oc_int_frag_copy2_mmxext(_dst,_ystride,_src1,_src2,_ystride);
-}
-
-#endif
+/********************************************************************
+ *                                                                  *
+ * THIS FILE IS PART OF THE OggTheora SOFTWARE CODEC SOURCE CODE.   *
+ * USE, DISTRIBUTION AND REPRODUCTION OF THIS LIBRARY SOURCE IS     *
+ * GOVERNED BY A BSD-STYLE SOURCE LICENSE INCLUDED WITH THIS SOURCE *
+ * IN 'COPYING'. PLEASE READ THESE TERMS BEFORE DISTRIBUTING.       *
+ *                                                                  *
+ * THE Theora SOURCE CODE IS COPYRIGHT (C) 2002-2009                *
+ * by the Xiph.Org Foundation http://www.xiph.org/                  *
+ *                                                                  *
+ ********************************************************************
+
+  function:
+  last mod: $Id: dsp_mmx.c 14579 2008-03-12 06:42:40Z xiphmont $
+
+ ********************************************************************/
+#include <stddef.h>
+#include "x86enc.h"
+
+#if defined(OC_X86_ASM)
+
+unsigned oc_enc_frag_sad_mmxext(const unsigned char *_src,
+ const unsigned char *_ref,int _ystride){
+  ptrdiff_t ret;
+  __asm{
+#define SRC esi
+#define REF edx
+#define YSTRIDE ecx
+#define YSTRIDE3 edi
+    mov YSTRIDE,_ystride
+    mov SRC,_src
+    mov REF,_ref
+    /*Load the first 4 rows of each block.*/
+    movq mm0,[SRC]
+    movq mm1,[REF]
+    movq mm2,[SRC][YSTRIDE]
+    movq mm3,[REF][YSTRIDE]
+    lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
+    movq mm4,[SRC+YSTRIDE*2]
+    movq mm5,[REF+YSTRIDE*2]
+    movq mm6,[SRC+YSTRIDE3]
+    movq mm7,[REF+YSTRIDE3]
+    /*Compute their SADs and add them in mm0*/
+    psadbw mm0,mm1
+    psadbw mm2,mm3
+    lea SRC,[SRC+YSTRIDE*4]
+    paddw mm0,mm2
+    lea REF,[REF+YSTRIDE*4]
+    /*Load the next 3 rows as registers become available.*/
+    movq mm2,[SRC]
+    movq mm3,[REF]
+    psadbw mm4,mm5
+    psadbw mm6,mm7
+    paddw mm0,mm4
+    movq mm5,[REF+YSTRIDE]
+    movq mm4,[SRC+YSTRIDE]
+    paddw mm0,mm6
+    movq mm7,[REF+YSTRIDE*2]
+    movq mm6,[SRC+YSTRIDE*2]
+    /*Start adding their SADs to mm0*/
+    psadbw mm2,mm3
+    psadbw mm4,mm5
+    paddw mm0,mm2
+    psadbw mm6,mm7
+    /*Load last row as registers become available.*/
+    movq mm2,[SRC+YSTRIDE3]
+    movq mm3,[REF+YSTRIDE3]
+    /*And finish adding up their SADs.*/
+    paddw mm0,mm4
+    psadbw mm2,mm3
+    paddw mm0,mm6
+    paddw mm0,mm2
+    movd [ret],mm0
+#undef SRC
+#undef REF
+#undef YSTRIDE
+#undef YSTRIDE3
+  }
+  return (unsigned)ret;
+}
+
+unsigned oc_enc_frag_sad_thresh_mmxext(const unsigned char *_src,
+ const unsigned char *_ref,int _ystride,unsigned _thresh){
+  /*Early termination is for suckers.*/
+  return oc_enc_frag_sad_mmxext(_src,_ref,_ystride);
+}
+
+#define OC_SAD2_LOOP __asm{ \
+  /*We want to compute (mm0+mm1>>1) on unsigned bytes without overflow, but \
+     pavgb computes (mm0+mm1+1>>1). \
+   The latter is exactly 1 too large when the low bit of two corresponding \
+    bytes is only set in one of them. \
+   Therefore we pxor the operands, pand to mask out the low bits, and psubb to \
+    correct the output of pavgb.*/ \
+  __asm  movq mm6,mm0 \
+  __asm  lea REF1,[REF1+YSTRIDE*2] \
+  __asm  pxor mm0,mm1 \
+  __asm  pavgb mm6,mm1 \
+  __asm  lea REF2,[REF2+YSTRIDE*2] \
+  __asm  movq mm1,mm2 \
+  __asm  pand mm0,mm7 \
+  __asm  pavgb mm2,mm3 \
+  __asm  pxor mm1,mm3 \
+  __asm  movq mm3,[REF2+YSTRIDE] \
+  __asm  psubb mm6,mm0 \
+  __asm  movq mm0,[REF1] \
+  __asm  pand mm1,mm7 \
+  __asm  psadbw mm4,mm6 \
+  __asm  movd mm6,RET \
+  __asm  psubb mm2,mm1 \
+  __asm  movq mm1,[REF2] \
+  __asm  lea SRC,[SRC+YSTRIDE*2] \
+  __asm  psadbw mm5,mm2 \
+  __asm  movq mm2,[REF1+YSTRIDE] \
+  __asm  paddw mm5,mm4 \
+  __asm  movq mm4,[SRC] \
+  __asm  paddw mm6,mm5 \
+  __asm  movq mm5,[SRC+YSTRIDE] \
+  __asm  movd RET,mm6 \
+}
+
+/*Same as above, but does not pre-load the next two rows.*/
+#define OC_SAD2_TAIL __asm{ \
+  __asm  movq mm6,mm0 \
+  __asm  pavgb mm0,mm1 \
+  __asm  pxor mm6,mm1 \
+  __asm  movq mm1,mm2 \
+  __asm  pand mm6,mm7 \
+  __asm  pavgb mm2,mm3 \
+  __asm  pxor mm1,mm3 \
+  __asm  psubb mm0,mm6 \
+  __asm  pand mm1,mm7 \
+  __asm  psadbw mm4,mm0 \
+  __asm  psubb mm2,mm1 \
+  __asm  movd mm6,RET \
+  __asm  psadbw mm5,mm2 \
+  __asm  paddw mm5,mm4 \
+  __asm  paddw mm6,mm5 \
+  __asm  movd RET,mm6 \
+}
+
+unsigned oc_enc_frag_sad2_thresh_mmxext(const unsigned char *_src,
+ const unsigned char *_ref1,const unsigned char *_ref2,int _ystride,
+ unsigned _thresh){
+  ptrdiff_t ret;
+  __asm{
+#define REF1 ecx
+#define REF2 edi
+#define YSTRIDE esi
+#define SRC edx
+#define RET eax
+    mov YSTRIDE,_ystride
+    mov SRC,_src
+    mov REF1,_ref1
+    mov REF2,_ref2
+    movq mm0,[REF1]
+    movq mm1,[REF2]
+    movq mm2,[REF1+YSTRIDE]
+    movq mm3,[REF2+YSTRIDE]
+    xor RET,RET
+    movq mm4,[SRC]
+    pxor mm7,mm7
+    pcmpeqb mm6,mm6
+    movq mm5,[SRC+YSTRIDE]
+    psubb mm7,mm6
+    OC_SAD2_LOOP
+    OC_SAD2_LOOP
+    OC_SAD2_LOOP
+    OC_SAD2_TAIL
+    mov [ret],RET
+#undef REF1
+#undef REF2
+#undef YSTRIDE
+#undef SRC
+#undef RET
+  }
+  return (unsigned)ret;
+}
+
+/*Load an 8x4 array of pixel values from %[src] and %[ref] and compute their
+  16-bit difference in mm0...mm7.*/
+#define OC_LOAD_SUB_8x4(_off) __asm{ \
+  __asm  movd mm0,[_off+SRC] \
+  __asm  movd mm4,[_off+REF] \
+  __asm  movd mm1,[_off+SRC+SRC_YSTRIDE] \
+  __asm  lea SRC,[SRC+SRC_YSTRIDE*2] \
+  __asm  movd mm5,[_off+REF+REF_YSTRIDE] \
+  __asm  lea REF,[REF+REF_YSTRIDE*2] \
+  __asm  movd mm2,[_off+SRC] \
+  __asm  movd mm7,[_off+REF] \
+  __asm  movd mm3,[_off+SRC+SRC_YSTRIDE] \
+  __asm  movd mm6,[_off+REF+REF_YSTRIDE] \
+  __asm  punpcklbw mm0,mm4 \
+  __asm  lea SRC,[SRC+SRC_YSTRIDE*2] \
+  __asm  punpcklbw mm4,mm4 \
+  __asm  lea REF,[REF+REF_YSTRIDE*2] \
+  __asm  psubw mm0,mm4 \
+  __asm  movd mm4,[_off+SRC] \
+  __asm  movq [_off*2+BUF],mm0 \
+  __asm  movd mm0,[_off+REF] \
+  __asm  punpcklbw mm1,mm5 \
+  __asm  punpcklbw mm5,mm5 \
+  __asm  psubw mm1,mm5 \
+  __asm  movd mm5,[_off+SRC+SRC_YSTRIDE] \
+  __asm  punpcklbw mm2,mm7 \
+  __asm  punpcklbw mm7,mm7 \
+  __asm  psubw mm2,mm7 \
+  __asm  movd mm7,[_off+REF+REF_YSTRIDE] \
+  __asm  punpcklbw mm3,mm6 \
+  __asm  lea SRC,[SRC+SRC_YSTRIDE*2] \
+  __asm  punpcklbw mm6,mm6 \
+  __asm  psubw mm3,mm6 \
+  __asm  movd mm6,[_off+SRC] \
+  __asm  punpcklbw mm4,mm0 \
+  __asm  lea REF,[REF+REF_YSTRIDE*2] \
+  __asm  punpcklbw mm0,mm0 \
+  __asm  lea SRC,[SRC+SRC_YSTRIDE*2] \
+  __asm  psubw mm4,mm0 \
+  __asm  movd mm0,[_off+REF] \
+  __asm  punpcklbw mm5,mm7 \
+  __asm  neg SRC_YSTRIDE \
+  __asm  punpcklbw mm7,mm7 \
+  __asm  psubw mm5,mm7 \
+  __asm  movd mm7,[_off+SRC+SRC_YSTRIDE] \
+  __asm  punpcklbw mm6,mm0 \
+  __asm  lea REF,[REF+REF_YSTRIDE*2] \
+  __asm  punpcklbw mm0,mm0 \
+  __asm  neg REF_YSTRIDE \
+  __asm  psubw mm6,mm0 \
+  __asm  movd mm0,[_off+REF+REF_YSTRIDE] \
+  __asm  lea SRC,[SRC+SRC_YSTRIDE*8] \
+  __asm  punpcklbw mm7,mm0 \
+  __asm  neg SRC_YSTRIDE \
+  __asm  punpcklbw mm0,mm0 \
+  __asm  lea REF,[REF+REF_YSTRIDE*8] \
+  __asm  psubw mm7,mm0 \
+  __asm  neg REF_YSTRIDE \
+  __asm  movq mm0,[_off*2+BUF] \
+}
+
+/*Load an 8x4 array of pixel values from %[src] into %%mm0...%%mm7.*/
+#define OC_LOAD_8x4(_off) __asm{ \
+  __asm  movd mm0,[_off+SRC] \
+  __asm  movd mm1,[_off+SRC+YSTRIDE] \
+  __asm  movd mm2,[_off+SRC+YSTRIDE*2] \
+  __asm  pxor mm7,mm7 \
+  __asm  movd mm3,[_off+SRC+YSTRIDE3] \
+  __asm  punpcklbw mm0,mm7 \
+  __asm  movd mm4,[_off+SRC4] \
+  __asm  punpcklbw mm1,mm7 \
+  __asm  movd mm5,[_off+SRC4+YSTRIDE] \
+  __asm  punpcklbw mm2,mm7 \
+  __asm  movd mm6,[_off+SRC4+YSTRIDE*2] \
+  __asm  punpcklbw mm3,mm7 \
+  __asm  movd mm7,[_off+SRC4+YSTRIDE3] \
+  __asm  punpcklbw mm4,mm4 \
+  __asm  punpcklbw mm5,mm5 \
+  __asm  psrlw mm4,8 \
+  __asm  psrlw mm5,8 \
+  __asm  punpcklbw mm6,mm6 \
+  __asm  punpcklbw mm7,mm7 \
+  __asm  psrlw mm6,8 \
+  __asm  psrlw mm7,8 \
+}
+
+/*Performs the first two stages of an 8-point 1-D Hadamard transform.
+  The transform is performed in place, except that outputs 0-3 are swapped with
+   outputs 4-7.
+  Outputs 2, 3, 6 and 7 from the second stage are negated (which allows us to
+   perform this stage in place with no temporary registers).*/
+#define OC_HADAMARD_AB_8x4 __asm{ \
+  /*Stage A: \
+    Outputs 0-3 are swapped with 4-7 here.*/ \
+  __asm  paddw mm5,mm1 \
+  __asm  paddw mm6,mm2 \
+  __asm  paddw mm1,mm1 \
+  __asm  paddw mm2,mm2 \
+  __asm  psubw mm1,mm5 \
+  __asm  psubw mm2,mm6 \
+  __asm  paddw mm7,mm3 \
+  __asm  paddw mm4,mm0 \
+  __asm  paddw mm3,mm3 \
+  __asm  paddw mm0,mm0 \
+  __asm  psubw mm3,mm7 \
+  __asm  psubw mm0,mm4 \
+   /*Stage B:*/ \
+  __asm  paddw mm0,mm2 \
+  __asm  paddw mm1,mm3 \
+  __asm  paddw mm4,mm6 \
+  __asm  paddw mm5,mm7 \
+  __asm  paddw mm2,mm2 \
+  __asm  paddw mm3,mm3 \
+  __asm  paddw mm6,mm6 \
+  __asm  paddw mm7,mm7 \
+  __asm  psubw mm2,mm0 \
+  __asm  psubw mm3,mm1 \
+  __asm  psubw mm6,mm4 \
+  __asm  psubw mm7,mm5 \
+}
+
+/*Performs the last stage of an 8-point 1-D Hadamard transform in place.
+  Ouputs 1, 3, 5, and 7 are negated (which allows us to perform this stage in
+   place with no temporary registers).*/
+#define OC_HADAMARD_C_8x4 __asm{ \
+  /*Stage C:*/ \
+  __asm  paddw mm0,mm1 \
+  __asm  paddw mm2,mm3 \
+  __asm  paddw mm4,mm5 \
+  __asm  paddw mm6,mm7 \
+  __asm  paddw mm1,mm1 \
+  __asm  paddw mm3,mm3 \
+  __asm  paddw mm5,mm5 \
+  __asm  paddw mm7,mm7 \
+  __asm  psubw mm1,mm0 \
+  __asm  psubw mm3,mm2 \
+  __asm  psubw mm5,mm4 \
+  __asm  psubw mm7,mm6 \
+}
+
+/*Performs an 8-point 1-D Hadamard transform.
+  The transform is performed in place, except that outputs 0-3 are swapped with
+   outputs 4-7.
+  Outputs 1, 2, 5 and 6 are negated (which allows us to perform the transform
+   in place with no temporary registers).*/
+#define OC_HADAMARD_8x4 __asm{ \
+  OC_HADAMARD_AB_8x4 \
+  OC_HADAMARD_C_8x4 \
+}
+
+/*Performs the first part of the final stage of the Hadamard transform and
+   summing of absolute values.
+  At the end of this part, mm1 will contain the DC coefficient of the
+   transform.*/
+#define OC_HADAMARD_C_ABS_ACCUM_A_8x4(_r6,_r7) __asm{ \
+  /*We use the fact that \
+      (abs(a+b)+abs(a-b))/2=max(abs(a),abs(b)) \
+     to merge the final butterfly with the abs and the first stage of \
+     accumulation. \
+    Thus we can avoid using pabsw, which is not available until SSSE3. \
+    Emulating pabsw takes 3 instructions, so the straightforward MMXEXT \
+     implementation would be (3+3)*8+7=55 instructions (+4 for spilling \
+     registers). \
+    Even with pabsw, it would be (3+1)*8+7=39 instructions (with no spills). \
+    This implementation is only 26 (+4 for spilling registers).*/ \
+  __asm  movq [_r7+BUF],mm7 \
+  __asm  movq [_r6+BUF],mm6 \
+  /*mm7={0x7FFF}x4 \
+    mm0=max(abs(mm0),abs(mm1))-0x7FFF*/ \
+  __asm  pcmpeqb mm7,mm7 \
+  __asm  movq mm6,mm0 \
+  __asm  psrlw mm7,1 \
+  __asm  paddw mm6,mm1 \
+  __asm  pmaxsw mm0,mm1 \
+  __asm  paddsw mm6,mm7 \
+  __asm  psubw mm0,mm6 \
+  /*mm2=max(abs(mm2),abs(mm3))-0x7FFF \
+    mm4=max(abs(mm4),abs(mm5))-0x7FFF*/ \
+  __asm  movq mm6,mm2 \
+  __asm  movq mm1,mm4 \
+  __asm  pmaxsw mm2,mm3 \
+  __asm  pmaxsw mm4,mm5 \
+  __asm  paddw mm6,mm3 \
+  __asm  paddw mm1,mm5 \
+  __asm  movq mm3,[_r7+BUF] \
+}
+
+/*Performs the second part of the final stage of the Hadamard transform and
+   summing of absolute values.*/
+#define OC_HADAMARD_C_ABS_ACCUM_B_8x4(_r6,_r7) __asm{ \
+  __asm  paddsw mm6,mm7 \
+  __asm  movq mm5,[_r6+BUF] \
+  __asm  paddsw mm1,mm7 \
+  __asm  psubw mm2,mm6 \
+  __asm  psubw mm4,mm1 \
+  /*mm7={1}x4 (needed for the horizontal add that follows) \
+    mm0+=mm2+mm4+max(abs(mm3),abs(mm5))-0x7FFF*/ \
+  __asm  movq mm6,mm3 \
+  __asm  pmaxsw mm3,mm5 \
+  __asm  paddw mm0,mm2 \
+  __asm  paddw mm6,mm5 \
+  __asm  paddw mm0,mm4 \
+  __asm  paddsw mm6,mm7 \
+  __asm  paddw mm0,mm3 \
+  __asm  psrlw mm7,14 \
+  __asm  psubw mm0,mm6 \
+}
+
+/*Performs the last stage of an 8-point 1-D Hadamard transform, takes the
+   absolute value of each component, and accumulates everything into mm0.
+  This is the only portion of SATD which requires MMXEXT (we could use plain
+   MMX, but it takes 4 instructions and an extra register to work around the
+   lack of a pmaxsw, which is a pretty serious penalty).*/
+#define OC_HADAMARD_C_ABS_ACCUM_8x4(_r6,_r7) __asm{ \
+  OC_HADAMARD_C_ABS_ACCUM_A_8x4(_r6,_r7) \
+  OC_HADAMARD_C_ABS_ACCUM_B_8x4(_r6,_r7) \
+}
+
+/*Performs an 8-point 1-D Hadamard transform, takes the absolute value of each
+   component, and accumulates everything into mm0.
+  Note that mm0 will have an extra 4 added to each column, and that after
+   removing this value, the remainder will be half the conventional value.*/
+#define OC_HADAMARD_ABS_ACCUM_8x4(_r6,_r7) __asm{ \
+  OC_HADAMARD_AB_8x4 \
+  OC_HADAMARD_C_ABS_ACCUM_8x4(_r6,_r7) \
+}
+
+/*Performs two 4x4 transposes (mostly) in place.
+  On input, {mm0,mm1,mm2,mm3} contains rows {e,f,g,h}, and {mm4,mm5,mm6,mm7}
+   contains rows {a,b,c,d}.
+  On output, {0x40,0x50,0x60,0x70}+_off+BUF contains {e,f,g,h}^T, and
+   {mm4,mm5,mm6,mm7} contains the transposed rows {a,b,c,d}^T.*/
+#define OC_TRANSPOSE_4x4x2(_off) __asm{ \
+  /*First 4x4 transpose:*/ \
+  __asm  movq [0x10+_off+BUF],mm5 \
+  /*mm0 = e3 e2 e1 e0 \
+    mm1 = f3 f2 f1 f0 \
+    mm2 = g3 g2 g1 g0 \
+    mm3 = h3 h2 h1 h0*/ \
+  __asm  movq mm5,mm2 \
+  __asm  punpcklwd mm2,mm3 \
+  __asm  punpckhwd mm5,mm3 \
+  __asm  movq mm3,mm0 \
+  __asm  punpcklwd mm0,mm1 \
+  __asm  punpckhwd mm3,mm1 \
+  /*mm0 = f1 e1 f0 e0 \
+    mm3 = f3 e3 f2 e2 \
+    mm2 = h1 g1 h0 g0 \
+    mm5 = h3 g3 h2 g2*/ \
+  __asm  movq mm1,mm0 \
+  __asm  punpckldq mm0,mm2 \
+  __asm  punpckhdq mm1,mm2 \
+  __asm  movq mm2,mm3 \
+  __asm  punpckhdq mm3,mm5 \
+  __asm  movq [0x40+_off+BUF],mm0 \
+  __asm  punpckldq mm2,mm5 \
+  /*mm0 = h0 g0 f0 e0 \
+    mm1 = h1 g1 f1 e1 \
+    mm2 = h2 g2 f2 e2 \
+    mm3 = h3 g3 f3 e3*/ \
+  __asm  movq mm5,[0x10+_off+BUF] \
+  /*Second 4x4 transpose:*/ \
+  /*mm4 = a3 a2 a1 a0 \
+    mm5 = b3 b2 b1 b0 \
+    mm6 = c3 c2 c1 c0 \
+    mm7 = d3 d2 d1 d0*/ \
+  __asm  movq mm0,mm6 \
+  __asm  punpcklwd mm6,mm7 \
+  __asm  movq [0x50+_off+BUF],mm1 \
+  __asm  punpckhwd mm0,mm7 \
+  __asm  movq mm7,mm4 \
+  __asm  punpcklwd mm4,mm5 \
+  __asm  movq [0x60+_off+BUF],mm2 \
+  __asm  punpckhwd mm7,mm5 \
+  /*mm4 = b1 a1 b0 a0 \
+    mm7 = b3 a3 b2 a2 \
+    mm6 = d1 c1 d0 c0 \
+    mm0 = d3 c3 d2 c2*/ \
+  __asm  movq mm5,mm4 \
+  __asm  punpckldq mm4,mm6 \
+  __asm  movq [0x70+_off+BUF],mm3 \
+  __asm  punpckhdq mm5,mm6 \
+  __asm  movq mm6,mm7 \
+  __asm  punpckhdq mm7,mm0 \
+  __asm  punpckldq mm6,mm0 \
+  /*mm4 = d0 c0 b0 a0 \
+    mm5 = d1 c1 b1 a1 \
+    mm6 = d2 c2 b2 a2 \
+    mm7 = d3 c3 b3 a3*/ \
+}
+
+static unsigned oc_int_frag_satd_thresh_mmxext(const unsigned char *_src,
+ int _src_ystride,const unsigned char *_ref,int _ref_ystride,unsigned _thresh){
+  OC_ALIGN8(ogg_int16_t  buf[64]);
+  ogg_int16_t           *bufp;
+  unsigned               ret1;
+  unsigned               ret2;
+  bufp=buf;
+  __asm{
+#define SRC esi
+#define REF eax
+#define SRC_YSTRIDE ecx
+#define REF_YSTRIDE edx
+#define BUF edi
+#define RET eax
+#define RET2 edx
+    mov SRC,_src
+    mov SRC_YSTRIDE,_src_ystride
+    mov REF,_ref
+    mov REF_YSTRIDE,_ref_ystride
+    mov BUF,bufp
+    OC_LOAD_SUB_8x4(0x00)
+    OC_HADAMARD_8x4
+    OC_TRANSPOSE_4x4x2(0x00)
+    /*Finish swapping out this 8x4 block to make room for the next one.
+      mm0...mm3 have been swapped out already.*/
+    movq [0x00+BUF],mm4
+    movq [0x10+BUF],mm5
+    movq [0x20+BUF],mm6
+    movq [0x30+BUF],mm7
+    OC_LOAD_SUB_8x4(0x04)
+    OC_HADAMARD_8x4
+    OC_TRANSPOSE_4x4x2(0x08)
+    /*Here the first 4x4 block of output from the last transpose is the second
+       4x4 block of input for the next transform.
+      We have cleverly arranged that it already be in the appropriate place, so
+       we only have to do half the loads.*/
+    movq mm1,[0x10+BUF]
+    movq mm2,[0x20+BUF]
+    movq mm3,[0x30+BUF]
+    movq mm0,[0x00+BUF]
+    OC_HADAMARD_ABS_ACCUM_8x4(0x28,0x38)
+    /*Up to this point, everything fit in 16 bits (8 input + 1 for the
+       difference + 2*3 for the two 8-point 1-D Hadamards - 1 for the abs - 1
+       for the factor of two we dropped + 3 for the vertical accumulation).
+      Now we finally have to promote things to dwords.
+      We break this part out of OC_HADAMARD_ABS_ACCUM_8x4 to hide the long
+       latency of pmaddwd by starting the next series of loads now.*/
+    mov RET2,_thresh
+    pmaddwd mm0,mm7
+    movq mm1,[0x50+BUF]
+    movq mm5,[0x58+BUF]
+    movq mm4,mm0
+    movq mm2,[0x60+BUF]
+    punpckhdq mm0,mm0
+    movq mm6,[0x68+BUF]
+    paddd mm4,mm0
+    movq mm3,[0x70+BUF]
+    movd RET,mm4
+    movq mm7,[0x78+BUF]
+    /*The sums produced by OC_HADAMARD_ABS_ACCUM_8x4 each have an extra 4
+       added to them, and a factor of two removed; correct the final sum here.*/
+    lea RET,[RET+RET-32]
+    movq mm0,[0x40+BUF]
+    cmp RET,RET2
+    movq mm4,[0x48+BUF]
+    jae at_end
+    OC_HADAMARD_ABS_ACCUM_8x4(0x68,0x78)
+    pmaddwd mm0,mm7
+    /*There isn't much to stick in here to hide the latency this time, but the
+       alternative to pmaddwd is movq->punpcklwd->punpckhwd->paddd, whose
+       latency is even worse.*/
+    sub RET,32
+    movq mm4,mm0
+    punpckhdq mm0,mm0
+    paddd mm4,mm0
+    movd RET2,mm4
+    lea RET,[RET+RET2*2]
+    align 16
+at_end:
+    mov ret1,RET
+#undef SRC
+#undef REF
+#undef SRC_YSTRIDE
+#undef REF_YSTRIDE
+#undef BUF
+#undef RET
+#undef RET2
+  }
+  return ret1;
+}
+
+unsigned oc_enc_frag_satd_thresh_mmxext(const unsigned char *_src,
+ const unsigned char *_ref,int _ystride,unsigned _thresh){
+  return oc_int_frag_satd_thresh_mmxext(_src,_ystride,_ref,_ystride,_thresh);
+}
+
+
+/*Our internal implementation of frag_copy2 takes an extra stride parameter so
+   we can share code with oc_enc_frag_satd2_thresh_mmxext().*/
+static void oc_int_frag_copy2_mmxext(unsigned char *_dst,int _dst_ystride,
+ const unsigned char *_src1,const unsigned char *_src2,int _src_ystride){
+  __asm{
+    /*Load the first 3 rows.*/
+#define DST_YSTRIDE edi
+#define SRC_YSTRIDE esi
+#define DST eax
+#define SRC1 edx
+#define SRC2 ecx
+    mov DST_YSTRIDE,_dst_ystride
+    mov SRC_YSTRIDE,_src_ystride
+    mov DST,_dst
+    mov SRC1,_src1
+    mov SRC2,_src2
+    movq mm0,[SRC1]
+    movq mm1,[SRC2]
+    movq mm2,[SRC1+SRC_YSTRIDE]
+    lea SRC1,[SRC1+SRC_YSTRIDE*2]
+    movq mm3,[SRC2+SRC_YSTRIDE]
+    lea SRC2,[SRC2+SRC_YSTRIDE*2]
+    pxor mm7,mm7
+    movq mm4,[SRC1]
+    pcmpeqb mm6,mm6
+    movq mm5,[SRC2]
+    /*mm7={1}x8.*/
+    psubb mm7,mm6
+    /*Start averaging mm0 and mm1 into mm6.*/
+    movq mm6,mm0
+    pxor mm0,mm1
+    pavgb mm6,mm1
+    /*mm1 is free, start averaging mm3 into mm2 using mm1.*/
+    movq mm1,mm2
+    pand mm0,mm7
+    pavgb mm2,mm3
+    pxor mm1,mm3
+    /*mm3 is free.*/
+    psubb mm6,mm0
+    /*mm0 is free, start loading the next row.*/
+    movq mm0,[SRC1+SRC_YSTRIDE]
+    /*Start averaging mm5 and mm4 using mm3.*/
+    movq mm3,mm4
+    /*mm6 [row 0] is done; write it out.*/
+    movq [DST],mm6
+    pand mm1,mm7
+    pavgb mm4,mm5
+    psubb mm2,mm1
+    /*mm1 is free, continue loading the next row.*/
+    movq mm1,[SRC2+SRC_YSTRIDE]
+    pxor mm3,mm5
+    lea SRC1,[SRC1+SRC_YSTRIDE*2]
+    /*mm2 [row 1] is done; write it out.*/
+    movq [DST+DST_YSTRIDE],mm2
+    pand mm3,mm7
+    /*Start loading the next row.*/
+    movq mm2,[SRC1]
+    lea DST,[DST+DST_YSTRIDE*2]
+    psubb mm4,mm3
+    lea SRC2,[SRC2+SRC_YSTRIDE*2]
+    /*mm4 [row 2] is done; write it out.*/
+    movq [DST],mm4
+    /*Continue loading the next row.*/
+    movq mm3,[SRC2]
+    /*Start averaging mm0 and mm1 into mm6.*/
+    movq mm6,mm0
+    pxor mm0,mm1
+    /*Start loading the next row.*/
+    movq mm4,[SRC1+SRC_YSTRIDE]
+    pavgb mm6,mm1
+    /*mm1 is free; start averaging mm3 into mm2 using mm1.*/
+    movq mm1,mm2
+    pand mm0,mm7
+    /*Continue loading the next row.*/
+    movq mm5,[SRC2+SRC_YSTRIDE]
+    pavgb mm2,mm3
+    lea SRC1,[SRC1+SRC_YSTRIDE*2]
+    pxor mm1,mm3
+    /*mm3 is free.*/
+    psubb mm6,mm0
+    /*mm0 is free, start loading the next row.*/
+    movq mm0,[SRC1]
+    /*Start averaging mm5 into mm4 using mm3.*/
+    movq mm3,mm4
+    /*mm6 [row 3] is done; write it out.*/
+    movq [DST+DST_YSTRIDE],mm6
+    pand mm1,mm7
+    lea SRC2,[SRC2+SRC_YSTRIDE*2]
+    pavgb mm4,mm5
+    lea DST,[DST+DST_YSTRIDE*2]
+    psubb mm2,mm1
+    /*mm1 is free; continue loading the next row.*/
+    movq mm1,[SRC2]
+    pxor mm3,mm5
+    /*mm2 [row 4] is done; write it out.*/
+    movq [DST],mm2
+    pand mm3,mm7
+    /*Start loading the next row.*/
+    movq mm2,[SRC1+SRC_YSTRIDE]
+    psubb mm4,mm3
+    /*Start averaging mm0 and mm1 into mm6.*/
+    movq mm6,mm0
+    /*Continue loading the next row.*/
+    movq mm3,[SRC2+SRC_YSTRIDE]
+    /*mm4 [row 5] is done; write it out.*/
+    movq [DST+DST_YSTRIDE],mm4
+    pxor mm0,mm1
+    pavgb mm6,mm1
+    /*mm4 is free; start averaging mm3 into mm2 using mm4.*/
+    movq mm4,mm2
+    pand mm0,mm7
+    pavgb mm2,mm3
+    pxor mm4,mm3
+    lea DST,[DST+DST_YSTRIDE*2]
+    psubb mm6,mm0
+    pand mm4,mm7
+    /*mm6 [row 6] is done, write it out.*/
+    movq [DST],mm6
+    psubb mm2,mm4
+    /*mm2 [row 7] is done, write it out.*/
+    movq [DST+DST_YSTRIDE],mm2
+#undef SRC1
+#undef SRC2
+#undef SRC_YSTRIDE
+#undef DST_YSTRIDE
+#undef DST
+  }
+}
+
+unsigned oc_enc_frag_satd2_thresh_mmxext(const unsigned char *_src,
+ const unsigned char *_ref1,const unsigned char *_ref2,int _ystride,
+ unsigned _thresh){
+  OC_ALIGN8(unsigned char ref[64]);
+  oc_int_frag_copy2_mmxext(ref,8,_ref1,_ref2,_ystride);
+  return oc_int_frag_satd_thresh_mmxext(_src,_ystride,ref,8,_thresh);
+}
+
+unsigned oc_enc_frag_intra_satd_mmxext(const unsigned char *_src,
+ int _ystride){
+  OC_ALIGN8(ogg_int16_t  buf[64]);
+  ogg_int16_t           *bufp;
+  unsigned               ret1;
+  unsigned               ret2;
+  bufp=buf;
+  __asm{
+#define SRC eax
+#define SRC4 esi
+#define BUF edi
+#define RET eax
+#define RET_WORD ax
+#define RET2 ecx
+#define YSTRIDE edx
+#define YSTRIDE3 ecx
+    mov SRC,_src
+    mov BUF,bufp
+    mov YSTRIDE,_ystride
+    /* src4 = src+4*ystride */
+    lea SRC4,[SRC+YSTRIDE*4]
+    /* ystride3 = 3*ystride */
+    lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
+    OC_LOAD_8x4(0x00)
+    OC_HADAMARD_8x4
+    OC_TRANSPOSE_4x4x2(0x00)
+    /*Finish swapping out this 8x4 block to make room for the next one.
+      mm0...mm3 have been swapped out already.*/
+    movq [0x00+BUF],mm4
+    movq [0x10+BUF],mm5
+    movq [0x20+BUF],mm6
+    movq [0x30+BUF],mm7
+    OC_LOAD_8x4(0x04)
+    OC_HADAMARD_8x4
+    OC_TRANSPOSE_4x4x2(0x08)
+    /*Here the first 4x4 block of output from the last transpose is the second
+      4x4 block of input for the next transform.
+      We have cleverly arranged that it already be in the appropriate place, so
+      we only have to do half the loads.*/
+    movq mm1,[0x10+BUF]
+    movq mm2,[0x20+BUF]
+    movq mm3,[0x30+BUF]
+    movq mm0,[0x00+BUF]
+    /*We split out the stages here so we can save the DC coefficient in the
+      middle.*/
+    OC_HADAMARD_AB_8x4
+    OC_HADAMARD_C_ABS_ACCUM_A_8x4(0x28,0x38)
+    movd RET,mm1
+    OC_HADAMARD_C_ABS_ACCUM_B_8x4(0x28,0x38)
+    /*Up to this point, everything fit in 16 bits (8 input + 1 for the
+      difference + 2*3 for the two 8-point 1-D Hadamards - 1 for the abs - 1
+      for the factor of two we dropped + 3 for the vertical accumulation).
+      Now we finally have to promote things to dwords.
+      We break this part out of OC_HADAMARD_ABS_ACCUM_8x4 to hide the long
+      latency of pmaddwd by starting the next series of loads now.*/
+    pmaddwd mm0,mm7
+    movq mm1,[0x50+BUF]
+    movq mm5,[0x58+BUF]
+    movq mm2,[0x60+BUF]
+    movq mm4,mm0
+    movq mm6,[0x68+BUF]
+    punpckhdq mm0,mm0
+    movq mm3,[0x70+BUF]
+    paddd mm4,mm0
+    movq mm7,[0x78+BUF]
+    movd RET2,mm4
+    movq mm0,[0x40+BUF]
+    movq mm4,[0x48+BUF]
+    OC_HADAMARD_ABS_ACCUM_8x4(0x68,0x78)
+    pmaddwd mm0,mm7
+    /*We assume that the DC coefficient is always positive (which is true,
+    because the input to the INTRA transform was not a difference).*/
+    movzx RET,RET_WORD
+    add RET2,RET2
+    sub RET2,RET
+    movq mm4,mm0
+    punpckhdq mm0,mm0
+    paddd mm4,mm0
+    movd RET,mm4
+    lea RET,[-64+RET2+RET*2]
+    mov [ret1],RET
+#undef SRC
+#undef SRC4
+#undef BUF
+#undef RET
+#undef RET_WORD
+#undef RET2
+#undef YSTRIDE
+#undef YSTRIDE3
+  }
+  return ret1;
+}
+
+void oc_enc_frag_sub_mmx(ogg_int16_t _residue[64],
+ const unsigned char *_src, const unsigned char *_ref,int _ystride){
+  int i;
+  __asm  pxor mm7,mm7
+  for(i=4;i-->0;){
+    __asm{
+#define SRC edx
+#define YSTRIDE esi
+#define RESIDUE eax
+#define REF ecx
+      mov YSTRIDE,_ystride
+      mov RESIDUE,_residue
+      mov SRC,_src
+      mov REF,_ref
+      /*mm0=[src]*/
+      movq mm0,[SRC]
+      /*mm1=[ref]*/
+      movq mm1,[REF]
+      /*mm4=[src+ystride]*/
+      movq mm4,[SRC+YSTRIDE]
+      /*mm5=[ref+ystride]*/
+      movq mm5,[REF+YSTRIDE]
+      /*Compute [src]-[ref].*/
+      movq mm2,mm0
+      punpcklbw mm0,mm7
+      movq mm3,mm1
+      punpckhbw mm2,mm7
+      punpcklbw mm1,mm7
+      punpckhbw mm3,mm7
+      psubw mm0,mm1
+      psubw mm2,mm3
+      /*Compute [src+ystride]-[ref+ystride].*/
+      movq mm1,mm4
+      punpcklbw mm4,mm7
+      movq mm3,mm5
+      punpckhbw mm1,mm7
+      lea SRC,[SRC+YSTRIDE*2]
+      punpcklbw mm5,mm7
+      lea REF,[REF+YSTRIDE*2]
+      punpckhbw mm3,mm7
+      psubw mm4,mm5
+      psubw mm1,mm3
+      /*Write the answer out.*/
+      movq [RESIDUE+0x00],mm0
+      movq [RESIDUE+0x08],mm2
+      movq [RESIDUE+0x10],mm4
+      movq [RESIDUE+0x18],mm1
+      lea RESIDUE,[RESIDUE+0x20]
+      mov _residue,RESIDUE
+      mov _src,SRC
+      mov _ref,REF
+#undef SRC
+#undef YSTRIDE
+#undef RESIDUE
+#undef REF
+    }
+  }
+}
+
+void oc_enc_frag_sub_128_mmx(ogg_int16_t _residue[64],
+ const unsigned char *_src,int _ystride){
+   __asm{
+#define YSTRIDE edx
+#define YSTRIDE3 edi
+#define RESIDUE ecx
+#define SRC eax
+    mov YSTRIDE,_ystride
+    mov RESIDUE,_residue
+    mov SRC,_src
+    /*mm0=[src]*/
+    movq mm0,[SRC]
+    /*mm1=[src+ystride]*/
+    movq mm1,[SRC+YSTRIDE]
+    /*mm6={-1}x4*/
+    pcmpeqw mm6,mm6
+    /*mm2=[src+2*ystride]*/
+    movq mm2,[SRC+YSTRIDE*2]
+    /*[ystride3]=3*[ystride]*/
+    lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
+    /*mm6={1}x4*/
+    psllw mm6,15
+    /*mm3=[src+3*ystride]*/
+    movq mm3,[SRC+YSTRIDE3]
+    /*mm6={128}x4*/
+    psrlw mm6,8
+    /*mm7=0*/ 
+    pxor mm7,mm7
+    /*[src]=[src]+4*[ystride]*/
+    lea SRC,[SRC+YSTRIDE*4]
+    /*Compute [src]-128 and [src+ystride]-128*/
+    movq mm4,mm0
+    punpcklbw mm0,mm7
+    movq mm5,mm1
+    punpckhbw mm4,mm7
+    psubw mm0,mm6
+    punpcklbw mm1,mm7
+    psubw mm4,mm6
+    punpckhbw mm5,mm7
+    psubw mm1,mm6
+    psubw mm5,mm6
+    /*Write the answer out.*/
+    movq [RESIDUE+0x00],mm0
+    movq [RESIDUE+0x08],mm4
+    movq [RESIDUE+0x10],mm1
+    movq [RESIDUE+0x18],mm5
+    /*mm0=[src+4*ystride]*/
+    movq mm0,[SRC]
+    /*mm1=[src+5*ystride]*/
+    movq mm1,[SRC+YSTRIDE]
+    /*Compute [src+2*ystride]-128 and [src+3*ystride]-128*/
+    movq mm4,mm2
+    punpcklbw mm2,mm7
+    movq mm5,mm3
+    punpckhbw mm4,mm7
+    psubw mm2,mm6
+    punpcklbw mm3,mm7
+    psubw mm4,mm6
+    punpckhbw mm5,mm7
+    psubw mm3,mm6
+    psubw mm5,mm6
+    /*Write the answer out.*/
+    movq [RESIDUE+0x20],mm2
+    movq [RESIDUE+0x28],mm4
+    movq [RESIDUE+0x30],mm3
+    movq [RESIDUE+0x38],mm5
+    /*Compute [src+6*ystride]-128 and [src+7*ystride]-128*/
+    movq mm2,[SRC+YSTRIDE*2]
+    movq mm3,[SRC+YSTRIDE3]
+    movq mm4,mm0
+    punpcklbw mm0,mm7
+    movq mm5,mm1
+    punpckhbw mm4,mm7
+    psubw mm0,mm6
+    punpcklbw mm1,mm7
+    psubw mm4,mm6
+    punpckhbw mm5,mm7
+    psubw mm1,mm6
+    psubw mm5,mm6
+    /*Write the answer out.*/
+    movq [RESIDUE+0x40],mm0
+    movq [RESIDUE+0x48],mm4
+    movq [RESIDUE+0x50],mm1
+    movq [RESIDUE+0x58],mm5
+    /*Compute [src+6*ystride]-128 and [src+7*ystride]-128*/
+    movq mm4,mm2
+    punpcklbw mm2,mm7
+    movq mm5,mm3
+    punpckhbw mm4,mm7
+    psubw mm2,mm6
+    punpcklbw mm3,mm7
+    psubw mm4,mm6
+    punpckhbw mm5,mm7
+    psubw mm3,mm6
+    psubw mm5,mm6
+    /*Write the answer out.*/
+    movq [RESIDUE+0x60],mm2
+    movq [RESIDUE+0x68],mm4
+    movq [RESIDUE+0x70],mm3
+    movq [RESIDUE+0x78],mm5
+#undef YSTRIDE
+#undef YSTRIDE3
+#undef RESIDUE
+#undef SRC
+  }
+}
+
+void oc_enc_frag_copy2_mmxext(unsigned char *_dst,
+ const unsigned char *_src1,const unsigned char *_src2,int _ystride){
+  oc_int_frag_copy2_mmxext(_dst,_ystride,_src1,_src2,_ystride);
+}
+
+#endif
diff --git a/thirdparty/libtheora/x86_vc/mmxfdct.c b/thirdparty/libtheora/x86_vc/mmxfdct.c
index dcf17c9fa7..d908ce2413 100644
--- a/thirdparty/libtheora/x86_vc/mmxfdct.c
+++ b/thirdparty/libtheora/x86_vc/mmxfdct.c
@@ -1,670 +1,670 @@
-/********************************************************************
- *                                                                  *
- * THIS FILE IS PART OF THE OggTheora SOFTWARE CODEC SOURCE CODE.   *
- * USE, DISTRIBUTION AND REPRODUCTION OF THIS LIBRARY SOURCE IS     *
- * GOVERNED BY A BSD-STYLE SOURCE LICENSE INCLUDED WITH THIS SOURCE *
- * IN 'COPYING'. PLEASE READ THESE TERMS BEFORE DISTRIBUTING.       *
- *                                                                  *
- * THE Theora SOURCE CODE IS COPYRIGHT (C) 1999-2006                *
- * by the Xiph.Org Foundation http://www.xiph.org/                  *
- *                                                                  *
- ********************************************************************/ 
- /*MMX fDCT implementation for x86_32*/
-/*$Id: fdct_ses2.c 14579 2008-03-12 06:42:40Z xiphmont $*/
-#include "x86enc.h"
-
-#if defined(OC_X86_ASM)
-
-#define OC_FDCT_STAGE1_8x4  __asm{ \
-  /*Stage 1:*/ \
-  /*mm0=t7'=t0-t7*/ \
-  __asm  psubw mm0,mm7 \
-  __asm  paddw mm7,mm7 \
-  /*mm1=t6'=t1-t6*/ \
-  __asm  psubw mm1, mm6 \
-  __asm  paddw mm6,mm6 \
-  /*mm2=t5'=t2-t5*/ \
-  __asm  psubw mm2,mm5 \
-  __asm  paddw mm5,mm5 \
-  /*mm3=t4'=t3-t4*/ \
-  __asm  psubw mm3,mm4 \
-  __asm  paddw mm4,mm4 \
-  /*mm7=t0'=t0+t7*/ \
-  __asm  paddw mm7,mm0 \
-  /*mm6=t1'=t1+t6*/  \
-  __asm  paddw mm6,mm1 \
-  /*mm5=t2'=t2+t5*/ \
-  __asm  paddw mm5,mm2 \
-  /*mm4=t3'=t3+t4*/ \
-  __asm  paddw mm4,mm3\
-}
-
-#define OC_FDCT8x4(_r0,_r1,_r2,_r3,_r4,_r5,_r6,_r7) __asm{ \
-  /*Stage 2:*/ \
-  /*mm7=t3''=t0'-t3'*/ \
-  __asm  psubw mm7,mm4 \
-  __asm  paddw mm4,mm4 \
-  /*mm6=t2''=t1'-t2'*/ \
-  __asm  psubw mm6,mm5 \
-  __asm  movq [Y+_r6],mm7 \
-  __asm  paddw mm5,mm5 \
-  /*mm1=t5''=t6'-t5'*/ \
-  __asm  psubw mm1,mm2 \
-  __asm  movq [Y+_r2],mm6 \
-  /*mm4=t0''=t0'+t3'*/ \
-  __asm  paddw mm4,mm7 \
-  __asm  paddw mm2,mm2 \
-  /*mm5=t1''=t1'+t2'*/ \
-  __asm  movq [Y+_r0],mm4 \
-  __asm  paddw mm5,mm6 \
-  /*mm2=t6''=t6'+t5'*/ \
-  __asm  paddw mm2,mm1 \
-  __asm  movq [Y+_r4],mm5 \
-  /*mm0=t7', mm1=t5'', mm2=t6'', mm3=t4'.*/ \
-  /*mm4, mm5, mm6, mm7 are free.*/ \
-  /*Stage 3:*/ \
-  /*mm6={2}x4, mm7={27146,0xB500>>1}x2*/ \
-  __asm  mov A,0x5A806A0A \
-  __asm  pcmpeqb mm6,mm6 \
-  __asm  movd mm7,A \
-  __asm  psrlw mm6,15 \
-  __asm  punpckldq mm7,mm7 \
-  __asm  paddw mm6,mm6 \
-  /*mm0=0, m2={-1}x4 \
-    mm5:mm4=t5''*27146+0xB500*/ \
-  __asm  movq mm4,mm1 \
-  __asm  movq mm5,mm1 \
-  __asm  punpcklwd mm4,mm6 \
-  __asm  movq [Y+_r3],mm2 \
-  __asm  pmaddwd mm4,mm7 \
-  __asm  movq [Y+_r7],mm0 \
-  __asm  punpckhwd mm5,mm6 \
-  __asm  pxor mm0,mm0 \
-  __asm  pmaddwd mm5,mm7 \
-  __asm  pcmpeqb mm2,mm2 \
-  /*mm2=t6'', mm1=t5''+(t5''!=0) \
-    mm4=(t5''*27146+0xB500>>16)*/ \
-  __asm  pcmpeqw mm0,mm1 \
-  __asm  psrad mm4,16 \
-  __asm  psubw mm0,mm2 \
-  __asm  movq mm2, [Y+_r3] \
-  __asm  psrad mm5,16 \
-  __asm  paddw mm1,mm0 \
-  __asm  packssdw mm4,mm5 \
-  /*mm4=s=(t5''*27146+0xB500>>16)+t5''+(t5''!=0)>>1*/ \
-  __asm  paddw mm4,mm1 \
-  __asm  movq mm0, [Y+_r7] \
-  __asm  psraw mm4,1 \
-  __asm  movq mm1,mm3 \
-  /*mm3=t4''=t4'+s*/ \
-  __asm  paddw mm3,mm4 \
-  /*mm1=t5'''=t4'-s*/ \
-  __asm  psubw mm1,mm4 \
-  /*mm1=0, mm3={-1}x4 \
-    mm5:mm4=t6''*27146+0xB500*/ \
-  __asm  movq mm4,mm2 \
-  __asm  movq mm5,mm2 \
-  __asm  punpcklwd mm4,mm6 \
-  __asm  movq [Y+_r5],mm1 \
-  __asm  pmaddwd mm4,mm7 \
-  __asm  movq [Y+_r1],mm3 \
-  __asm  punpckhwd mm5,mm6 \
-  __asm  pxor mm1,mm1 \
-  __asm  pmaddwd mm5,mm7 \
-  __asm  pcmpeqb mm3,mm3 \
-  /*mm2=t6''+(t6''!=0), mm4=(t6''*27146+0xB500>>16)*/ \
-  __asm  psrad mm4,16 \
-  __asm  pcmpeqw mm1,mm2 \
-  __asm  psrad mm5,16 \
-  __asm  psubw mm1,mm3 \
-  __asm  packssdw mm4,mm5 \
-  __asm  paddw mm2,mm1 \
-  /*mm1=t1'' \
-    mm4=s=(t6''*27146+0xB500>>16)+t6''+(t6''!=0)>>1*/ \
-  __asm  paddw mm4,mm2 \
-  __asm  movq mm1,[Y+_r4] \
-  __asm  psraw mm4,1 \
-  __asm  movq mm2,mm0 \
-  /*mm7={54491-0x7FFF,0x7FFF}x2 \
-    mm0=t7''=t7'+s*/ \
-  __asm  paddw mm0,mm4 \
-  /*mm2=t6'''=t7'-s*/ \
-  __asm  psubw mm2,mm4 \
-  /*Stage 4:*/ \
-  /*mm0=0, mm2=t0'' \
-    mm5:mm4=t1''*27146+0xB500*/ \
-  __asm  movq mm4,mm1 \
-  __asm  movq mm5,mm1 \
-  __asm  punpcklwd mm4,mm6 \
-  __asm  movq [Y+_r3],mm2 \
-  __asm  pmaddwd mm4,mm7 \
-  __asm  movq mm2,[Y+_r0] \
-  __asm  punpckhwd mm5,mm6 \
-  __asm  movq [Y+_r7],mm0 \
-  __asm  pmaddwd mm5,mm7 \
-  __asm  pxor mm0,mm0 \
-  /*mm7={27146,0x4000>>1}x2 \
-    mm0=s=(t1''*27146+0xB500>>16)+t1''+(t1''!=0)*/ \
-  __asm  psrad mm4,16 \
-  __asm  mov A,0x20006A0A \
-  __asm  pcmpeqw mm0,mm1 \
-  __asm  movd mm7,A \
-  __asm  psrad mm5,16 \
-  __asm  psubw mm0,mm3 \
-  __asm  packssdw mm4,mm5 \
-  __asm  paddw mm0,mm1 \
-  __asm  punpckldq mm7,mm7 \
-  __asm  paddw mm0,mm4 \
-  /*mm6={0x00000E3D}x2 \
-    mm1=-(t0''==0), mm5:mm4=t0''*27146+0x4000*/ \
-  __asm  movq mm4,mm2 \
-  __asm  movq mm5,mm2 \
-  __asm  punpcklwd mm4,mm6 \
-  __asm  mov A,0x0E3D \
-  __asm  pmaddwd mm4,mm7 \
-  __asm  punpckhwd mm5,mm6 \
-  __asm  movd mm6,A \
-  __asm  pmaddwd mm5,mm7 \
-  __asm  pxor mm1,mm1 \
-  __asm  punpckldq mm6,mm6 \
-  __asm  pcmpeqw mm1,mm2 \
-  /*mm4=r=(t0''*27146+0x4000>>16)+t0''+(t0''!=0)*/ \
-  __asm  psrad mm4,16 \
-  __asm  psubw mm1,mm3 \
-  __asm  psrad mm5,16 \
-  __asm  paddw mm2,mm1 \
-  __asm  packssdw mm4,mm5 \
-  __asm  movq mm1,[Y+_r5] \
-  __asm  paddw mm4,mm2 \
-  /*mm2=t6'', mm0=_y[0]=u=r+s>>1 \
-    The naive implementation could cause overflow, so we use \
-     u=(r&s)+((r^s)>>1).*/ \
-  __asm  movq mm2,[Y+_r3] \
-  __asm  movq mm7,mm0 \
-  __asm  pxor mm0,mm4 \
-  __asm  pand mm7,mm4 \
-  __asm  psraw mm0,1 \
-  __asm  mov A,0x7FFF54DC \
-  __asm  paddw mm0,mm7 \
-  __asm  movd mm7,A \
-  /*mm7={54491-0x7FFF,0x7FFF}x2 \
-    mm4=_y[4]=v=r-u*/ \
-  __asm  psubw mm4,mm0 \
-  __asm  punpckldq mm7,mm7 \
-  __asm  movq [Y+_r4],mm4 \
-  /*mm0=0, mm7={36410}x4 \
-    mm1=(t5'''!=0), mm5:mm4=54491*t5'''+0x0E3D*/ \
-  __asm  movq mm4,mm1 \
-  __asm  movq mm5,mm1 \
-  __asm  punpcklwd mm4,mm1 \
-  __asm  mov A,0x8E3A8E3A \
-  __asm  pmaddwd mm4,mm7 \
-  __asm  movq [Y+_r0],mm0 \
-  __asm  punpckhwd mm5,mm1 \
-  __asm  pxor mm0,mm0 \
-  __asm  pmaddwd mm5,mm7 \
-  __asm  pcmpeqw mm1,mm0 \
-  __asm  movd mm7,A \
-  __asm  psubw mm1,mm3 \
-  __asm  punpckldq mm7,mm7 \
-  __asm  paddd mm4,mm6 \
-  __asm  paddd mm5,mm6 \
-  /*mm0=0 \
-    mm3:mm1=36410*t6'''+((t5'''!=0)<<16)*/ \
-  __asm  movq mm6,mm2 \
-  __asm  movq mm3,mm2 \
-  __asm  pmulhw mm6,mm7 \
-  __asm  paddw mm1,mm2 \
-  __asm  pmullw mm3,mm7 \
-  __asm  pxor mm0,mm0 \
-  __asm  paddw mm6,mm1 \
-  __asm  movq mm1,mm3 \
-  __asm  punpckhwd mm3,mm6 \
-  __asm  punpcklwd mm1,mm6 \
-  /*mm3={-1}x4, mm6={1}x4 \
-    mm4=_y[5]=u=(54491*t5'''+36410*t6'''+0x0E3D>>16)+(t5'''!=0)*/ \
-  __asm  paddd mm5,mm3 \
-  __asm  paddd mm4,mm1 \
-  __asm  psrad mm5,16 \
-  __asm  pxor mm6,mm6 \
-  __asm  psrad mm4,16 \
-  __asm  pcmpeqb mm3,mm3 \
-  __asm  packssdw mm4,mm5 \
-  __asm  psubw mm6,mm3 \
-  /*mm1=t7'', mm7={26568,0x3400}x2 \
-    mm2=s=t6'''-(36410*u>>16)*/ \
-  __asm  movq mm1,mm4 \
-  __asm  mov A,0x340067C8 \
-  __asm  pmulhw mm4,mm7 \
-  __asm  movd mm7,A \
-  __asm  movq [Y+_r5],mm1 \
-  __asm  punpckldq mm7,mm7 \
-  __asm  paddw mm4,mm1 \
-  __asm  movq mm1,[Y+_r7] \
-  __asm  psubw mm2,mm4 \
-  /*mm6={0x00007B1B}x2 \
-    mm0=(s!=0), mm5:mm4=s*26568+0x3400*/ \
-  __asm  movq mm4,mm2 \
-  __asm  movq mm5,mm2 \
-  __asm  punpcklwd mm4,mm6 \
-  __asm  pcmpeqw mm0,mm2 \
-  __asm  pmaddwd mm4,mm7 \
-  __asm  mov A,0x7B1B \
-  __asm  punpckhwd mm5,mm6 \
-  __asm  movd mm6,A \
-  __asm  pmaddwd mm5,mm7 \
-  __asm  psubw mm0,mm3 \
-  __asm  punpckldq mm6,mm6 \
-  /*mm7={64277-0x7FFF,0x7FFF}x2 \
-    mm2=_y[3]=v=(s*26568+0x3400>>17)+s+(s!=0)*/ \
-  __asm  psrad mm4,17 \
-  __asm  paddw mm2,mm0 \
-  __asm  psrad mm5,17 \
-  __asm  mov A,0x7FFF7B16 \
-  __asm  packssdw mm4,mm5 \
-  __asm  movd mm7,A \
-  __asm  paddw mm2,mm4 \
-  __asm  punpckldq mm7,mm7 \
-  /*mm0=0, mm7={12785}x4 \
-    mm1=(t7''!=0), mm2=t4'', mm5:mm4=64277*t7''+0x7B1B*/ \
-  __asm  movq mm4,mm1 \
-  __asm  movq mm5,mm1 \
-  __asm  movq [Y+_r3],mm2 \
-  __asm  punpcklwd mm4,mm1 \
-  __asm  movq mm2,[Y+_r1] \
-  __asm  pmaddwd mm4,mm7 \
-  __asm  mov A,0x31F131F1 \
-  __asm  punpckhwd mm5,mm1 \
-  __asm  pxor mm0,mm0 \
-  __asm  pmaddwd mm5,mm7 \
-  __asm  pcmpeqw mm1,mm0 \
-  __asm  movd mm7,A \
-  __asm  psubw mm1,mm3 \
-  __asm  punpckldq mm7,mm7 \
-  __asm  paddd mm4,mm6 \
-  __asm  paddd mm5,mm6 \
-  /*mm3:mm1=12785*t4'''+((t7''!=0)<<16)*/ \
-  __asm  movq mm6,mm2 \
-  __asm  movq mm3,mm2 \
-  __asm  pmulhw mm6,mm7 \
-  __asm  pmullw mm3,mm7 \
-  __asm  paddw mm6,mm1 \
-  __asm  movq mm1,mm3 \
-  __asm  punpckhwd mm3,mm6 \
-  __asm  punpcklwd mm1,mm6 \
-  /*mm3={-1}x4, mm6={1}x4 \
-    mm4=_y[1]=u=(12785*t4'''+64277*t7''+0x7B1B>>16)+(t7''!=0)*/ \
-  __asm  paddd mm5,mm3 \
-  __asm  paddd mm4,mm1 \
-  __asm  psrad mm5,16 \
-  __asm  pxor mm6,mm6 \
-  __asm  psrad mm4,16 \
-  __asm  pcmpeqb mm3,mm3 \
-  __asm  packssdw mm4,mm5 \
-  __asm  psubw mm6,mm3 \
-  /*mm1=t3'', mm7={20539,0x3000}x2 \
-    mm4=s=(12785*u>>16)-t4''*/ \
-  __asm  movq [Y+_r1],mm4 \
-  __asm  pmulhw mm4,mm7 \
-  __asm  mov A,0x3000503B \
-  __asm  movq mm1,[Y+_r6] \
-  __asm  movd mm7,A \
-  __asm  psubw mm4,mm2 \
-  __asm  punpckldq mm7,mm7 \
-  /*mm6={0x00006CB7}x2 \
-    mm0=(s!=0), mm5:mm4=s*20539+0x3000*/ \
-  __asm  movq mm5,mm4 \
-  __asm  movq mm2,mm4 \
-  __asm  punpcklwd mm4,mm6 \
-  __asm  pcmpeqw mm0,mm2 \
-  __asm  pmaddwd mm4,mm7 \
-  __asm  mov A,0x6CB7 \
-  __asm  punpckhwd mm5,mm6 \
-  __asm  movd mm6,A \
-  __asm  pmaddwd mm5,mm7 \
-  __asm  psubw mm0,mm3 \
-  __asm  punpckldq mm6,mm6 \
-  /*mm7={60547-0x7FFF,0x7FFF}x2 \
-    mm2=_y[7]=v=(s*20539+0x3000>>20)+s+(s!=0)*/ \
-  __asm  psrad mm4,20 \
-  __asm  paddw mm2,mm0 \
-  __asm  psrad mm5,20 \
-  __asm  mov A,0x7FFF6C84 \
-  __asm  packssdw mm4,mm5 \
-  __asm  movd mm7,A \
-  __asm  paddw mm2,mm4 \
-  __asm  punpckldq mm7,mm7 \
-  /*mm0=0, mm7={25080}x4 \
-    mm2=t2'', mm5:mm4=60547*t3''+0x6CB7*/ \
-  __asm  movq mm4,mm1 \
-  __asm  movq mm5,mm1 \
-  __asm  movq [Y+_r7],mm2 \
-  __asm  punpcklwd mm4,mm1 \
-  __asm  movq mm2,[Y+_r2] \
-  __asm  pmaddwd mm4,mm7 \
-  __asm  mov A,0x61F861F8 \
-  __asm  punpckhwd mm5,mm1 \
-  __asm  pxor mm0,mm0 \
-  __asm  pmaddwd mm5,mm7 \
-  __asm  movd mm7,A \
-  __asm  pcmpeqw mm1,mm0 \
-  __asm  psubw mm1,mm3 \
-  __asm  punpckldq mm7,mm7 \
-  __asm  paddd mm4,mm6 \
-  __asm  paddd mm5,mm6 \
-  /*mm3:mm1=25080*t2''+((t3''!=0)<<16)*/ \
-  __asm  movq mm6,mm2 \
-  __asm  movq mm3,mm2 \
-  __asm  pmulhw mm6,mm7 \
-  __asm  pmullw mm3,mm7 \
-  __asm  paddw mm6,mm1 \
-  __asm  movq mm1,mm3 \
-  __asm  punpckhwd mm3,mm6 \
-  __asm  punpcklwd mm1,mm6 \
-  /*mm1={-1}x4 \
-    mm4=u=(25080*t2''+60547*t3''+0x6CB7>>16)+(t3''!=0)*/ \
-  __asm  paddd mm5,mm3 \
-  __asm  paddd mm4,mm1 \
-  __asm  psrad mm5,16 \
-  __asm  mov A,0x28005460 \
-  __asm  psrad mm4,16 \
-  __asm  pcmpeqb mm1,mm1 \
-  __asm  packssdw mm4,mm5 \
-  /*mm5={1}x4, mm6=_y[2]=u, mm7={21600,0x2800}x2 \
-    mm4=s=(25080*u>>16)-t2''*/ \
-  __asm  movq mm6,mm4 \
-  __asm  pmulhw mm4,mm7 \
-  __asm  pxor mm5,mm5 \
-  __asm  movd mm7,A \
-  __asm  psubw mm5,mm1 \
-  __asm  punpckldq mm7,mm7 \
-  __asm  psubw mm4,mm2 \
-  /*mm2=s+(s!=0) \
-    mm4:mm3=s*21600+0x2800*/ \
-  __asm  movq mm3,mm4 \
-  __asm  movq mm2,mm4 \
-  __asm  punpckhwd mm4,mm5 \
-  __asm  pcmpeqw mm0,mm2 \
-  __asm  pmaddwd mm4,mm7 \
-  __asm  psubw mm0,mm1 \
-  __asm  punpcklwd mm3,mm5 \
-  __asm  paddw mm2,mm0 \
-  __asm  pmaddwd mm3,mm7 \
-  /*mm0=_y[4], mm1=_y[7], mm4=_y[0], mm5=_y[5] \
-    mm3=_y[6]=v=(s*21600+0x2800>>18)+s+(s!=0)*/ \
-  __asm  movq mm0,[Y+_r4] \
-  __asm  psrad mm4,18 \
-  __asm  movq mm5,[Y+_r5] \
-  __asm  psrad mm3,18 \
-  __asm  movq mm1,[Y+_r7] \
-  __asm  packssdw mm3,mm4 \
-  __asm  movq mm4,[Y+_r0] \
-  __asm  paddw mm3,mm2 \
-}
-
-/*On input, mm4=_y[0], mm6=_y[2], mm0=_y[4], mm5=_y[5], mm3=_y[6], mm1=_y[7].
-  On output, {_y[4],mm1,mm2,mm3} contains the transpose of _y[4...7] and
-   {mm4,mm5,mm6,mm7} contains the transpose of _y[0...3].*/
-#define OC_TRANSPOSE8x4(_r0,_r1,_r2,_r3,_r4,_r5,_r6,_r7) __asm{ \
-  /*First 4x4 transpose:*/ \
-  /*mm0 = e3 e2 e1 e0 \
-    mm5 = f3 f2 f1 f0 \
-    mm3 = g3 g2 g1 g0 \
-    mm1 = h3 h2 h1 h0*/ \
-  __asm  movq mm2,mm0 \
-  __asm  punpcklwd mm0,mm5 \
-  __asm  punpckhwd mm2,mm5 \
-  __asm  movq mm5,mm3 \
-  __asm  punpcklwd mm3,mm1 \
-  __asm  punpckhwd mm5,mm1 \
-  /*mm0 = f1 e1 f0 e0 \
-    mm2 = f3 e3 f2 e2 \
-    mm3 = h1 g1 h0 g0 \
-    mm5 = h3 g3 h2 g2*/ \
-  __asm  movq mm1,mm0 \
-  __asm  punpckldq mm0,mm3 \
-  __asm  movq [Y+_r4],mm0 \
-  __asm  punpckhdq mm1,mm3 \
-  __asm  movq mm0,[Y+_r1] \
-  __asm  movq mm3,mm2 \
-  __asm  punpckldq mm2,mm5 \
-  __asm  punpckhdq mm3,mm5 \
-  __asm  movq mm5,[Y+_r3] \
-  /*_y[4] = h0 g0 f0 e0 \
-   mm1  = h1 g1 f1 e1 \
-   mm2  = h2 g2 f2 e2 \
-   mm3  = h3 g3 f3 e3*/ \
-  /*Second 4x4 transpose:*/ \
-  /*mm4 = a3 a2 a1 a0 \
-    mm0 = b3 b2 b1 b0 \
-    mm6 = c3 c2 c1 c0 \
-    mm5 = d3 d2 d1 d0*/ \
-  __asm  movq mm7,mm4 \
-  __asm  punpcklwd mm4,mm0 \
-  __asm  punpckhwd mm7,mm0 \
-  __asm  movq mm0,mm6 \
-  __asm  punpcklwd mm6,mm5 \
-  __asm  punpckhwd mm0,mm5 \
-  /*mm4 = b1 a1 b0 a0 \
-    mm7 = b3 a3 b2 a2 \
-    mm6 = d1 c1 d0 c0 \
-    mm0 = d3 c3 d2 c2*/ \
-  __asm  movq mm5,mm4 \
-  __asm  punpckldq mm4,mm6 \
-  __asm  punpckhdq mm5,mm6 \
-  __asm  movq mm6,mm7 \
-  __asm  punpckhdq mm7,mm0 \
-  __asm  punpckldq mm6,mm0 \
-  /*mm4 = d0 c0 b0 a0 \
-    mm5 = d1 c1 b1 a1 \
-    mm6 = d2 c2 b2 a2 \
-    mm7 = d3 c3 b3 a3*/ \
-}
-
-/*MMX implementation of the fDCT.*/
-void oc_enc_fdct8x8_mmx(ogg_int16_t _y[64],const ogg_int16_t _x[64]){
-  ptrdiff_t a;
-  __asm{
-#define Y eax
-#define A ecx
-#define X edx
-    /*Add two extra bits of working precision to improve accuracy; any more and
-       we could overflow.*/
-    /*We also add biases to correct for some systematic error that remains in
-       the full fDCT->iDCT round trip.*/
-    mov X, _x
-    mov Y, _y
-    movq mm0,[0x00+X]
-    movq mm1,[0x10+X]
-    movq mm2,[0x20+X]
-    movq mm3,[0x30+X]
-    pcmpeqb mm4,mm4
-    pxor mm7,mm7
-    movq mm5,mm0
-    psllw mm0,2
-    pcmpeqw mm5,mm7
-    movq mm7,[0x70+X]
-    psllw mm1,2
-    psubw mm5,mm4
-    psllw mm2,2
-    mov A,1
-    pslld mm5,16
-    movd mm6,A
-    psllq mm5,16
-    mov A,0x10001
-    psllw mm3,2
-    movd mm4,A
-    punpckhwd mm5,mm6
-    psubw mm1,mm6
-    movq mm6,[0x60+X]
-    paddw mm0,mm5
-    movq mm5,[0x50+X]
-    paddw mm0,mm4
-    movq mm4,[0x40+X]
-    /*We inline stage1 of the transform here so we can get better instruction
-       scheduling with the shifts.*/
-    /*mm0=t7'=t0-t7*/
-    psllw mm7,2
-    psubw mm0,mm7
-    psllw mm6,2
-    paddw mm7,mm7
-    /*mm1=t6'=t1-t6*/
-    psllw mm5,2
-    psubw mm1,mm6
-    psllw mm4,2
-    paddw mm6,mm6
-    /*mm2=t5'=t2-t5*/
-    psubw mm2,mm5
-    paddw mm5,mm5
-    /*mm3=t4'=t3-t4*/
-    psubw mm3,mm4
-    paddw mm4,mm4
-    /*mm7=t0'=t0+t7*/
-    paddw mm7,mm0
-    /*mm6=t1'=t1+t6*/
-    paddw mm6,mm1
-    /*mm5=t2'=t2+t5*/
-    paddw mm5,mm2
-    /*mm4=t3'=t3+t4*/
-    paddw mm4,mm3
-    OC_FDCT8x4(0x00,0x10,0x20,0x30,0x40,0x50,0x60,0x70)
-    OC_TRANSPOSE8x4(0x00,0x10,0x20,0x30,0x40,0x50,0x60,0x70)
-    /*Swap out this 8x4 block for the next one.*/
-    movq mm0,[0x08+X]
-    movq [0x30+Y],mm7
-    movq mm7,[0x78+X]
-    movq [0x50+Y],mm1
-    movq mm1,[0x18+X]
-    movq [0x20+Y],mm6
-    movq mm6,[0x68+X]
-    movq [0x60+Y],mm2
-    movq mm2,[0x28+X]
-    movq [0x10+Y],mm5
-    movq mm5,[0x58+X]
-    movq [0x70+Y],mm3
-    movq mm3,[0x38+X]
-    /*And increase its working precision, too.*/
-    psllw mm0,2
-    movq [0x00+Y],mm4
-    psllw mm7,2
-    movq mm4,[0x48+X]
-    /*We inline stage1 of the transform here so we can get better instruction
-       scheduling with the shifts.*/
-    /*mm0=t7'=t0-t7*/
-    psubw mm0,mm7
-    psllw mm1,2
-    paddw mm7,mm7
-    psllw mm6,2
-    /*mm1=t6'=t1-t6*/
-    psubw mm1,mm6
-    psllw mm2,2
-    paddw mm6,mm6
-    psllw mm5,2
-    /*mm2=t5'=t2-t5*/
-    psubw mm2,mm5
-    psllw mm3,2
-    paddw mm5,mm5
-    psllw mm4,2
-    /*mm3=t4'=t3-t4*/
-    psubw mm3,mm4
-    paddw mm4,mm4
-    /*mm7=t0'=t0+t7*/
-    paddw mm7,mm0
-    /*mm6=t1'=t1+t6*/
-    paddw mm6,mm1
-    /*mm5=t2'=t2+t5*/
-    paddw mm5,mm2
-    /*mm4=t3'=t3+t4*/
-    paddw mm4,mm3
-    OC_FDCT8x4(0x08,0x18,0x28,0x38,0x48,0x58,0x68,0x78)
-    OC_TRANSPOSE8x4(0x08,0x18,0x28,0x38,0x48,0x58,0x68,0x78)
-    /*Here the first 4x4 block of output from the last transpose is the second
-       4x4 block of input for the next transform.
-      We have cleverly arranged that it already be in the appropriate place,
-       so we only have to do half the stores and loads.*/
-    movq mm0,[0x00+Y]
-    movq [0x58+Y],mm1
-    movq mm1,[0x10+Y]
-    movq [0x68+Y],mm2
-    movq mm2,[0x20+Y]
-    movq [0x78+Y],mm3
-    movq mm3,[0x30+Y]
-    OC_FDCT_STAGE1_8x4
-    OC_FDCT8x4(0x00,0x10,0x20,0x30,0x08,0x18,0x28,0x38)
-    OC_TRANSPOSE8x4(0x00,0x10,0x20,0x30,0x08,0x18,0x28,0x38)
-    /*mm0={-2}x4*/
-    pcmpeqw mm0,mm0
-    paddw mm0,mm0
-    /*Round the results.*/
-    psubw mm1,mm0
-    psubw mm2,mm0
-    psraw mm1,2
-    psubw mm3,mm0
-    movq [0x18+Y],mm1
-    psraw mm2,2
-    psubw mm4,mm0
-    movq mm1,[0x08+Y]
-    psraw mm3,2
-    psubw mm5,mm0
-    psraw mm4,2
-    psubw mm6,mm0
-    psraw mm5,2
-    psubw mm7,mm0
-    psraw mm6,2
-    psubw mm1,mm0
-    psraw mm7,2
-    movq mm0,[0x40+Y]
-    psraw mm1,2
-    movq [0x30+Y],mm7
-    movq mm7,[0x78+Y]
-    movq [0x08+Y],mm1
-    movq mm1,[0x50+Y]
-    movq [0x20+Y],mm6
-    movq mm6,[0x68+Y]
-    movq [0x28+Y],mm2
-    movq mm2,[0x60+Y]
-    movq [0x10+Y],mm5
-    movq mm5,[0x58+Y]
-    movq [0x38+Y],mm3
-    movq mm3,[0x70+Y]
-    movq [0x00+Y],mm4
-    movq mm4,[0x48+Y]
-    OC_FDCT_STAGE1_8x4
-    OC_FDCT8x4(0x40,0x50,0x60,0x70,0x48,0x58,0x68,0x78)
-    OC_TRANSPOSE8x4(0x40,0x50,0x60,0x70,0x48,0x58,0x68,0x78)
-    /*mm0={-2}x4*/
-    pcmpeqw mm0,mm0
-    paddw mm0,mm0
-    /*Round the results.*/
-    psubw mm1,mm0
-    psubw mm2,mm0
-    psraw mm1,2
-    psubw mm3,mm0
-    movq [0x58+Y],mm1
-    psraw mm2,2
-    psubw mm4,mm0
-    movq mm1,[0x48+Y]
-    psraw mm3,2
-    psubw mm5,mm0
-    movq [0x68+Y],mm2
-    psraw mm4,2
-    psubw mm6,mm0
-    movq [0x78+Y],mm3
-    psraw mm5,2
-    psubw mm7,mm0
-    movq [0x40+Y],mm4
-    psraw mm6,2
-    psubw mm1,mm0
-    movq [0x50+Y],mm5
-    psraw mm7,2
-    movq [0x60+Y],mm6
-    psraw mm1,2
-    movq [0x70+Y],mm7
-    movq [0x48+Y],mm1
-#undef Y
-#undef A
-#undef X
-  }
-}
-
-#endif
+/********************************************************************
+ *                                                                  *
+ * THIS FILE IS PART OF THE OggTheora SOFTWARE CODEC SOURCE CODE.   *
+ * USE, DISTRIBUTION AND REPRODUCTION OF THIS LIBRARY SOURCE IS     *
+ * GOVERNED BY A BSD-STYLE SOURCE LICENSE INCLUDED WITH THIS SOURCE *
+ * IN 'COPYING'. PLEASE READ THESE TERMS BEFORE DISTRIBUTING.       *
+ *                                                                  *
+ * THE Theora SOURCE CODE IS COPYRIGHT (C) 1999-2006                *
+ * by the Xiph.Org Foundation http://www.xiph.org/                  *
+ *                                                                  *
+ ********************************************************************/ 
+ /*MMX fDCT implementation for x86_32*/
+/*$Id: fdct_ses2.c 14579 2008-03-12 06:42:40Z xiphmont $*/
+#include "x86enc.h"
+
+#if defined(OC_X86_ASM)
+
+#define OC_FDCT_STAGE1_8x4  __asm{ \
+  /*Stage 1:*/ \
+  /*mm0=t7'=t0-t7*/ \
+  __asm  psubw mm0,mm7 \
+  __asm  paddw mm7,mm7 \
+  /*mm1=t6'=t1-t6*/ \
+  __asm  psubw mm1, mm6 \
+  __asm  paddw mm6,mm6 \
+  /*mm2=t5'=t2-t5*/ \
+  __asm  psubw mm2,mm5 \
+  __asm  paddw mm5,mm5 \
+  /*mm3=t4'=t3-t4*/ \
+  __asm  psubw mm3,mm4 \
+  __asm  paddw mm4,mm4 \
+  /*mm7=t0'=t0+t7*/ \
+  __asm  paddw mm7,mm0 \
+  /*mm6=t1'=t1+t6*/  \
+  __asm  paddw mm6,mm1 \
+  /*mm5=t2'=t2+t5*/ \
+  __asm  paddw mm5,mm2 \
+  /*mm4=t3'=t3+t4*/ \
+  __asm  paddw mm4,mm3\
+}
+
+#define OC_FDCT8x4(_r0,_r1,_r2,_r3,_r4,_r5,_r6,_r7) __asm{ \
+  /*Stage 2:*/ \
+  /*mm7=t3''=t0'-t3'*/ \
+  __asm  psubw mm7,mm4 \
+  __asm  paddw mm4,mm4 \
+  /*mm6=t2''=t1'-t2'*/ \
+  __asm  psubw mm6,mm5 \
+  __asm  movq [Y+_r6],mm7 \
+  __asm  paddw mm5,mm5 \
+  /*mm1=t5''=t6'-t5'*/ \
+  __asm  psubw mm1,mm2 \
+  __asm  movq [Y+_r2],mm6 \
+  /*mm4=t0''=t0'+t3'*/ \
+  __asm  paddw mm4,mm7 \
+  __asm  paddw mm2,mm2 \
+  /*mm5=t1''=t1'+t2'*/ \
+  __asm  movq [Y+_r0],mm4 \
+  __asm  paddw mm5,mm6 \
+  /*mm2=t6''=t6'+t5'*/ \
+  __asm  paddw mm2,mm1 \
+  __asm  movq [Y+_r4],mm5 \
+  /*mm0=t7', mm1=t5'', mm2=t6'', mm3=t4'.*/ \
+  /*mm4, mm5, mm6, mm7 are free.*/ \
+  /*Stage 3:*/ \
+  /*mm6={2}x4, mm7={27146,0xB500>>1}x2*/ \
+  __asm  mov A,0x5A806A0A \
+  __asm  pcmpeqb mm6,mm6 \
+  __asm  movd mm7,A \
+  __asm  psrlw mm6,15 \
+  __asm  punpckldq mm7,mm7 \
+  __asm  paddw mm6,mm6 \
+  /*mm0=0, m2={-1}x4 \
+    mm5:mm4=t5''*27146+0xB500*/ \
+  __asm  movq mm4,mm1 \
+  __asm  movq mm5,mm1 \
+  __asm  punpcklwd mm4,mm6 \
+  __asm  movq [Y+_r3],mm2 \
+  __asm  pmaddwd mm4,mm7 \
+  __asm  movq [Y+_r7],mm0 \
+  __asm  punpckhwd mm5,mm6 \
+  __asm  pxor mm0,mm0 \
+  __asm  pmaddwd mm5,mm7 \
+  __asm  pcmpeqb mm2,mm2 \
+  /*mm2=t6'', mm1=t5''+(t5''!=0) \
+    mm4=(t5''*27146+0xB500>>16)*/ \
+  __asm  pcmpeqw mm0,mm1 \
+  __asm  psrad mm4,16 \
+  __asm  psubw mm0,mm2 \
+  __asm  movq mm2, [Y+_r3] \
+  __asm  psrad mm5,16 \
+  __asm  paddw mm1,mm0 \
+  __asm  packssdw mm4,mm5 \
+  /*mm4=s=(t5''*27146+0xB500>>16)+t5''+(t5''!=0)>>1*/ \
+  __asm  paddw mm4,mm1 \
+  __asm  movq mm0, [Y+_r7] \
+  __asm  psraw mm4,1 \
+  __asm  movq mm1,mm3 \
+  /*mm3=t4''=t4'+s*/ \
+  __asm  paddw mm3,mm4 \
+  /*mm1=t5'''=t4'-s*/ \
+  __asm  psubw mm1,mm4 \
+  /*mm1=0, mm3={-1}x4 \
+    mm5:mm4=t6''*27146+0xB500*/ \
+  __asm  movq mm4,mm2 \
+  __asm  movq mm5,mm2 \
+  __asm  punpcklwd mm4,mm6 \
+  __asm  movq [Y+_r5],mm1 \
+  __asm  pmaddwd mm4,mm7 \
+  __asm  movq [Y+_r1],mm3 \
+  __asm  punpckhwd mm5,mm6 \
+  __asm  pxor mm1,mm1 \
+  __asm  pmaddwd mm5,mm7 \
+  __asm  pcmpeqb mm3,mm3 \
+  /*mm2=t6''+(t6''!=0), mm4=(t6''*27146+0xB500>>16)*/ \
+  __asm  psrad mm4,16 \
+  __asm  pcmpeqw mm1,mm2 \
+  __asm  psrad mm5,16 \
+  __asm  psubw mm1,mm3 \
+  __asm  packssdw mm4,mm5 \
+  __asm  paddw mm2,mm1 \
+  /*mm1=t1'' \
+    mm4=s=(t6''*27146+0xB500>>16)+t6''+(t6''!=0)>>1*/ \
+  __asm  paddw mm4,mm2 \
+  __asm  movq mm1,[Y+_r4] \
+  __asm  psraw mm4,1 \
+  __asm  movq mm2,mm0 \
+  /*mm7={54491-0x7FFF,0x7FFF}x2 \
+    mm0=t7''=t7'+s*/ \
+  __asm  paddw mm0,mm4 \
+  /*mm2=t6'''=t7'-s*/ \
+  __asm  psubw mm2,mm4 \
+  /*Stage 4:*/ \
+  /*mm0=0, mm2=t0'' \
+    mm5:mm4=t1''*27146+0xB500*/ \
+  __asm  movq mm4,mm1 \
+  __asm  movq mm5,mm1 \
+  __asm  punpcklwd mm4,mm6 \
+  __asm  movq [Y+_r3],mm2 \
+  __asm  pmaddwd mm4,mm7 \
+  __asm  movq mm2,[Y+_r0] \
+  __asm  punpckhwd mm5,mm6 \
+  __asm  movq [Y+_r7],mm0 \
+  __asm  pmaddwd mm5,mm7 \
+  __asm  pxor mm0,mm0 \
+  /*mm7={27146,0x4000>>1}x2 \
+    mm0=s=(t1''*27146+0xB500>>16)+t1''+(t1''!=0)*/ \
+  __asm  psrad mm4,16 \
+  __asm  mov A,0x20006A0A \
+  __asm  pcmpeqw mm0,mm1 \
+  __asm  movd mm7,A \
+  __asm  psrad mm5,16 \
+  __asm  psubw mm0,mm3 \
+  __asm  packssdw mm4,mm5 \
+  __asm  paddw mm0,mm1 \
+  __asm  punpckldq mm7,mm7 \
+  __asm  paddw mm0,mm4 \
+  /*mm6={0x00000E3D}x2 \
+    mm1=-(t0''==0), mm5:mm4=t0''*27146+0x4000*/ \
+  __asm  movq mm4,mm2 \
+  __asm  movq mm5,mm2 \
+  __asm  punpcklwd mm4,mm6 \
+  __asm  mov A,0x0E3D \
+  __asm  pmaddwd mm4,mm7 \
+  __asm  punpckhwd mm5,mm6 \
+  __asm  movd mm6,A \
+  __asm  pmaddwd mm5,mm7 \
+  __asm  pxor mm1,mm1 \
+  __asm  punpckldq mm6,mm6 \
+  __asm  pcmpeqw mm1,mm2 \
+  /*mm4=r=(t0''*27146+0x4000>>16)+t0''+(t0''!=0)*/ \
+  __asm  psrad mm4,16 \
+  __asm  psubw mm1,mm3 \
+  __asm  psrad mm5,16 \
+  __asm  paddw mm2,mm1 \
+  __asm  packssdw mm4,mm5 \
+  __asm  movq mm1,[Y+_r5] \
+  __asm  paddw mm4,mm2 \
+  /*mm2=t6'', mm0=_y[0]=u=r+s>>1 \
+    The naive implementation could cause overflow, so we use \
+     u=(r&s)+((r^s)>>1).*/ \
+  __asm  movq mm2,[Y+_r3] \
+  __asm  movq mm7,mm0 \
+  __asm  pxor mm0,mm4 \
+  __asm  pand mm7,mm4 \
+  __asm  psraw mm0,1 \
+  __asm  mov A,0x7FFF54DC \
+  __asm  paddw mm0,mm7 \
+  __asm  movd mm7,A \
+  /*mm7={54491-0x7FFF,0x7FFF}x2 \
+    mm4=_y[4]=v=r-u*/ \
+  __asm  psubw mm4,mm0 \
+  __asm  punpckldq mm7,mm7 \
+  __asm  movq [Y+_r4],mm4 \
+  /*mm0=0, mm7={36410}x4 \
+    mm1=(t5'''!=0), mm5:mm4=54491*t5'''+0x0E3D*/ \
+  __asm  movq mm4,mm1 \
+  __asm  movq mm5,mm1 \
+  __asm  punpcklwd mm4,mm1 \
+  __asm  mov A,0x8E3A8E3A \
+  __asm  pmaddwd mm4,mm7 \
+  __asm  movq [Y+_r0],mm0 \
+  __asm  punpckhwd mm5,mm1 \
+  __asm  pxor mm0,mm0 \
+  __asm  pmaddwd mm5,mm7 \
+  __asm  pcmpeqw mm1,mm0 \
+  __asm  movd mm7,A \
+  __asm  psubw mm1,mm3 \
+  __asm  punpckldq mm7,mm7 \
+  __asm  paddd mm4,mm6 \
+  __asm  paddd mm5,mm6 \
+  /*mm0=0 \
+    mm3:mm1=36410*t6'''+((t5'''!=0)<<16)*/ \
+  __asm  movq mm6,mm2 \
+  __asm  movq mm3,mm2 \
+  __asm  pmulhw mm6,mm7 \
+  __asm  paddw mm1,mm2 \
+  __asm  pmullw mm3,mm7 \
+  __asm  pxor mm0,mm0 \
+  __asm  paddw mm6,mm1 \
+  __asm  movq mm1,mm3 \
+  __asm  punpckhwd mm3,mm6 \
+  __asm  punpcklwd mm1,mm6 \
+  /*mm3={-1}x4, mm6={1}x4 \
+    mm4=_y[5]=u=(54491*t5'''+36410*t6'''+0x0E3D>>16)+(t5'''!=0)*/ \
+  __asm  paddd mm5,mm3 \
+  __asm  paddd mm4,mm1 \
+  __asm  psrad mm5,16 \
+  __asm  pxor mm6,mm6 \
+  __asm  psrad mm4,16 \
+  __asm  pcmpeqb mm3,mm3 \
+  __asm  packssdw mm4,mm5 \
+  __asm  psubw mm6,mm3 \
+  /*mm1=t7'', mm7={26568,0x3400}x2 \
+    mm2=s=t6'''-(36410*u>>16)*/ \
+  __asm  movq mm1,mm4 \
+  __asm  mov A,0x340067C8 \
+  __asm  pmulhw mm4,mm7 \
+  __asm  movd mm7,A \
+  __asm  movq [Y+_r5],mm1 \
+  __asm  punpckldq mm7,mm7 \
+  __asm  paddw mm4,mm1 \
+  __asm  movq mm1,[Y+_r7] \
+  __asm  psubw mm2,mm4 \
+  /*mm6={0x00007B1B}x2 \
+    mm0=(s!=0), mm5:mm4=s*26568+0x3400*/ \
+  __asm  movq mm4,mm2 \
+  __asm  movq mm5,mm2 \
+  __asm  punpcklwd mm4,mm6 \
+  __asm  pcmpeqw mm0,mm2 \
+  __asm  pmaddwd mm4,mm7 \
+  __asm  mov A,0x7B1B \
+  __asm  punpckhwd mm5,mm6 \
+  __asm  movd mm6,A \
+  __asm  pmaddwd mm5,mm7 \
+  __asm  psubw mm0,mm3 \
+  __asm  punpckldq mm6,mm6 \
+  /*mm7={64277-0x7FFF,0x7FFF}x2 \
+    mm2=_y[3]=v=(s*26568+0x3400>>17)+s+(s!=0)*/ \
+  __asm  psrad mm4,17 \
+  __asm  paddw mm2,mm0 \
+  __asm  psrad mm5,17 \
+  __asm  mov A,0x7FFF7B16 \
+  __asm  packssdw mm4,mm5 \
+  __asm  movd mm7,A \
+  __asm  paddw mm2,mm4 \
+  __asm  punpckldq mm7,mm7 \
+  /*mm0=0, mm7={12785}x4 \
+    mm1=(t7''!=0), mm2=t4'', mm5:mm4=64277*t7''+0x7B1B*/ \
+  __asm  movq mm4,mm1 \
+  __asm  movq mm5,mm1 \
+  __asm  movq [Y+_r3],mm2 \
+  __asm  punpcklwd mm4,mm1 \
+  __asm  movq mm2,[Y+_r1] \
+  __asm  pmaddwd mm4,mm7 \
+  __asm  mov A,0x31F131F1 \
+  __asm  punpckhwd mm5,mm1 \
+  __asm  pxor mm0,mm0 \
+  __asm  pmaddwd mm5,mm7 \
+  __asm  pcmpeqw mm1,mm0 \
+  __asm  movd mm7,A \
+  __asm  psubw mm1,mm3 \
+  __asm  punpckldq mm7,mm7 \
+  __asm  paddd mm4,mm6 \
+  __asm  paddd mm5,mm6 \
+  /*mm3:mm1=12785*t4'''+((t7''!=0)<<16)*/ \
+  __asm  movq mm6,mm2 \
+  __asm  movq mm3,mm2 \
+  __asm  pmulhw mm6,mm7 \
+  __asm  pmullw mm3,mm7 \
+  __asm  paddw mm6,mm1 \
+  __asm  movq mm1,mm3 \
+  __asm  punpckhwd mm3,mm6 \
+  __asm  punpcklwd mm1,mm6 \
+  /*mm3={-1}x4, mm6={1}x4 \
+    mm4=_y[1]=u=(12785*t4'''+64277*t7''+0x7B1B>>16)+(t7''!=0)*/ \
+  __asm  paddd mm5,mm3 \
+  __asm  paddd mm4,mm1 \
+  __asm  psrad mm5,16 \
+  __asm  pxor mm6,mm6 \
+  __asm  psrad mm4,16 \
+  __asm  pcmpeqb mm3,mm3 \
+  __asm  packssdw mm4,mm5 \
+  __asm  psubw mm6,mm3 \
+  /*mm1=t3'', mm7={20539,0x3000}x2 \
+    mm4=s=(12785*u>>16)-t4''*/ \
+  __asm  movq [Y+_r1],mm4 \
+  __asm  pmulhw mm4,mm7 \
+  __asm  mov A,0x3000503B \
+  __asm  movq mm1,[Y+_r6] \
+  __asm  movd mm7,A \
+  __asm  psubw mm4,mm2 \
+  __asm  punpckldq mm7,mm7 \
+  /*mm6={0x00006CB7}x2 \
+    mm0=(s!=0), mm5:mm4=s*20539+0x3000*/ \
+  __asm  movq mm5,mm4 \
+  __asm  movq mm2,mm4 \
+  __asm  punpcklwd mm4,mm6 \
+  __asm  pcmpeqw mm0,mm2 \
+  __asm  pmaddwd mm4,mm7 \
+  __asm  mov A,0x6CB7 \
+  __asm  punpckhwd mm5,mm6 \
+  __asm  movd mm6,A \
+  __asm  pmaddwd mm5,mm7 \
+  __asm  psubw mm0,mm3 \
+  __asm  punpckldq mm6,mm6 \
+  /*mm7={60547-0x7FFF,0x7FFF}x2 \
+    mm2=_y[7]=v=(s*20539+0x3000>>20)+s+(s!=0)*/ \
+  __asm  psrad mm4,20 \
+  __asm  paddw mm2,mm0 \
+  __asm  psrad mm5,20 \
+  __asm  mov A,0x7FFF6C84 \
+  __asm  packssdw mm4,mm5 \
+  __asm  movd mm7,A \
+  __asm  paddw mm2,mm4 \
+  __asm  punpckldq mm7,mm7 \
+  /*mm0=0, mm7={25080}x4 \
+    mm2=t2'', mm5:mm4=60547*t3''+0x6CB7*/ \
+  __asm  movq mm4,mm1 \
+  __asm  movq mm5,mm1 \
+  __asm  movq [Y+_r7],mm2 \
+  __asm  punpcklwd mm4,mm1 \
+  __asm  movq mm2,[Y+_r2] \
+  __asm  pmaddwd mm4,mm7 \
+  __asm  mov A,0x61F861F8 \
+  __asm  punpckhwd mm5,mm1 \
+  __asm  pxor mm0,mm0 \
+  __asm  pmaddwd mm5,mm7 \
+  __asm  movd mm7,A \
+  __asm  pcmpeqw mm1,mm0 \
+  __asm  psubw mm1,mm3 \
+  __asm  punpckldq mm7,mm7 \
+  __asm  paddd mm4,mm6 \
+  __asm  paddd mm5,mm6 \
+  /*mm3:mm1=25080*t2''+((t3''!=0)<<16)*/ \
+  __asm  movq mm6,mm2 \
+  __asm  movq mm3,mm2 \
+  __asm  pmulhw mm6,mm7 \
+  __asm  pmullw mm3,mm7 \
+  __asm  paddw mm6,mm1 \
+  __asm  movq mm1,mm3 \
+  __asm  punpckhwd mm3,mm6 \
+  __asm  punpcklwd mm1,mm6 \
+  /*mm1={-1}x4 \
+    mm4=u=(25080*t2''+60547*t3''+0x6CB7>>16)+(t3''!=0)*/ \
+  __asm  paddd mm5,mm3 \
+  __asm  paddd mm4,mm1 \
+  __asm  psrad mm5,16 \
+  __asm  mov A,0x28005460 \
+  __asm  psrad mm4,16 \
+  __asm  pcmpeqb mm1,mm1 \
+  __asm  packssdw mm4,mm5 \
+  /*mm5={1}x4, mm6=_y[2]=u, mm7={21600,0x2800}x2 \
+    mm4=s=(25080*u>>16)-t2''*/ \
+  __asm  movq mm6,mm4 \
+  __asm  pmulhw mm4,mm7 \
+  __asm  pxor mm5,mm5 \
+  __asm  movd mm7,A \
+  __asm  psubw mm5,mm1 \
+  __asm  punpckldq mm7,mm7 \
+  __asm  psubw mm4,mm2 \
+  /*mm2=s+(s!=0) \
+    mm4:mm3=s*21600+0x2800*/ \
+  __asm  movq mm3,mm4 \
+  __asm  movq mm2,mm4 \
+  __asm  punpckhwd mm4,mm5 \
+  __asm  pcmpeqw mm0,mm2 \
+  __asm  pmaddwd mm4,mm7 \
+  __asm  psubw mm0,mm1 \
+  __asm  punpcklwd mm3,mm5 \
+  __asm  paddw mm2,mm0 \
+  __asm  pmaddwd mm3,mm7 \
+  /*mm0=_y[4], mm1=_y[7], mm4=_y[0], mm5=_y[5] \
+    mm3=_y[6]=v=(s*21600+0x2800>>18)+s+(s!=0)*/ \
+  __asm  movq mm0,[Y+_r4] \
+  __asm  psrad mm4,18 \
+  __asm  movq mm5,[Y+_r5] \
+  __asm  psrad mm3,18 \
+  __asm  movq mm1,[Y+_r7] \
+  __asm  packssdw mm3,mm4 \
+  __asm  movq mm4,[Y+_r0] \
+  __asm  paddw mm3,mm2 \
+}
+
+/*On input, mm4=_y[0], mm6=_y[2], mm0=_y[4], mm5=_y[5], mm3=_y[6], mm1=_y[7].
+  On output, {_y[4],mm1,mm2,mm3} contains the transpose of _y[4...7] and
+   {mm4,mm5,mm6,mm7} contains the transpose of _y[0...3].*/
+#define OC_TRANSPOSE8x4(_r0,_r1,_r2,_r3,_r4,_r5,_r6,_r7) __asm{ \
+  /*First 4x4 transpose:*/ \
+  /*mm0 = e3 e2 e1 e0 \
+    mm5 = f3 f2 f1 f0 \
+    mm3 = g3 g2 g1 g0 \
+    mm1 = h3 h2 h1 h0*/ \
+  __asm  movq mm2,mm0 \
+  __asm  punpcklwd mm0,mm5 \
+  __asm  punpckhwd mm2,mm5 \
+  __asm  movq mm5,mm3 \
+  __asm  punpcklwd mm3,mm1 \
+  __asm  punpckhwd mm5,mm1 \
+  /*mm0 = f1 e1 f0 e0 \
+    mm2 = f3 e3 f2 e2 \
+    mm3 = h1 g1 h0 g0 \
+    mm5 = h3 g3 h2 g2*/ \
+  __asm  movq mm1,mm0 \
+  __asm  punpckldq mm0,mm3 \
+  __asm  movq [Y+_r4],mm0 \
+  __asm  punpckhdq mm1,mm3 \
+  __asm  movq mm0,[Y+_r1] \
+  __asm  movq mm3,mm2 \
+  __asm  punpckldq mm2,mm5 \
+  __asm  punpckhdq mm3,mm5 \
+  __asm  movq mm5,[Y+_r3] \
+  /*_y[4] = h0 g0 f0 e0 \
+   mm1  = h1 g1 f1 e1 \
+   mm2  = h2 g2 f2 e2 \
+   mm3  = h3 g3 f3 e3*/ \
+  /*Second 4x4 transpose:*/ \
+  /*mm4 = a3 a2 a1 a0 \
+    mm0 = b3 b2 b1 b0 \
+    mm6 = c3 c2 c1 c0 \
+    mm5 = d3 d2 d1 d0*/ \
+  __asm  movq mm7,mm4 \
+  __asm  punpcklwd mm4,mm0 \
+  __asm  punpckhwd mm7,mm0 \
+  __asm  movq mm0,mm6 \
+  __asm  punpcklwd mm6,mm5 \
+  __asm  punpckhwd mm0,mm5 \
+  /*mm4 = b1 a1 b0 a0 \
+    mm7 = b3 a3 b2 a2 \
+    mm6 = d1 c1 d0 c0 \
+    mm0 = d3 c3 d2 c2*/ \
+  __asm  movq mm5,mm4 \
+  __asm  punpckldq mm4,mm6 \
+  __asm  punpckhdq mm5,mm6 \
+  __asm  movq mm6,mm7 \
+  __asm  punpckhdq mm7,mm0 \
+  __asm  punpckldq mm6,mm0 \
+  /*mm4 = d0 c0 b0 a0 \
+    mm5 = d1 c1 b1 a1 \
+    mm6 = d2 c2 b2 a2 \
+    mm7 = d3 c3 b3 a3*/ \
+}
+
+/*MMX implementation of the fDCT.*/
+void oc_enc_fdct8x8_mmx(ogg_int16_t _y[64],const ogg_int16_t _x[64]){
+  ptrdiff_t a;
+  __asm{
+#define Y eax
+#define A ecx
+#define X edx
+    /*Add two extra bits of working precision to improve accuracy; any more and
+       we could overflow.*/
+    /*We also add biases to correct for some systematic error that remains in
+       the full fDCT->iDCT round trip.*/
+    mov X, _x
+    mov Y, _y
+    movq mm0,[0x00+X]
+    movq mm1,[0x10+X]
+    movq mm2,[0x20+X]
+    movq mm3,[0x30+X]
+    pcmpeqb mm4,mm4
+    pxor mm7,mm7
+    movq mm5,mm0
+    psllw mm0,2
+    pcmpeqw mm5,mm7
+    movq mm7,[0x70+X]
+    psllw mm1,2
+    psubw mm5,mm4
+    psllw mm2,2
+    mov A,1
+    pslld mm5,16
+    movd mm6,A
+    psllq mm5,16
+    mov A,0x10001
+    psllw mm3,2
+    movd mm4,A
+    punpckhwd mm5,mm6
+    psubw mm1,mm6
+    movq mm6,[0x60+X]
+    paddw mm0,mm5
+    movq mm5,[0x50+X]
+    paddw mm0,mm4
+    movq mm4,[0x40+X]
+    /*We inline stage1 of the transform here so we can get better instruction
+       scheduling with the shifts.*/
+    /*mm0=t7'=t0-t7*/
+    psllw mm7,2
+    psubw mm0,mm7
+    psllw mm6,2
+    paddw mm7,mm7
+    /*mm1=t6'=t1-t6*/
+    psllw mm5,2
+    psubw mm1,mm6
+    psllw mm4,2
+    paddw mm6,mm6
+    /*mm2=t5'=t2-t5*/
+    psubw mm2,mm5
+    paddw mm5,mm5
+    /*mm3=t4'=t3-t4*/
+    psubw mm3,mm4
+    paddw mm4,mm4
+    /*mm7=t0'=t0+t7*/
+    paddw mm7,mm0
+    /*mm6=t1'=t1+t6*/
+    paddw mm6,mm1
+    /*mm5=t2'=t2+t5*/
+    paddw mm5,mm2
+    /*mm4=t3'=t3+t4*/
+    paddw mm4,mm3
+    OC_FDCT8x4(0x00,0x10,0x20,0x30,0x40,0x50,0x60,0x70)
+    OC_TRANSPOSE8x4(0x00,0x10,0x20,0x30,0x40,0x50,0x60,0x70)
+    /*Swap out this 8x4 block for the next one.*/
+    movq mm0,[0x08+X]
+    movq [0x30+Y],mm7
+    movq mm7,[0x78+X]
+    movq [0x50+Y],mm1
+    movq mm1,[0x18+X]
+    movq [0x20+Y],mm6
+    movq mm6,[0x68+X]
+    movq [0x60+Y],mm2
+    movq mm2,[0x28+X]
+    movq [0x10+Y],mm5
+    movq mm5,[0x58+X]
+    movq [0x70+Y],mm3
+    movq mm3,[0x38+X]
+    /*And increase its working precision, too.*/
+    psllw mm0,2
+    movq [0x00+Y],mm4
+    psllw mm7,2
+    movq mm4,[0x48+X]
+    /*We inline stage1 of the transform here so we can get better instruction
+       scheduling with the shifts.*/
+    /*mm0=t7'=t0-t7*/
+    psubw mm0,mm7
+    psllw mm1,2
+    paddw mm7,mm7
+    psllw mm6,2
+    /*mm1=t6'=t1-t6*/
+    psubw mm1,mm6
+    psllw mm2,2
+    paddw mm6,mm6
+    psllw mm5,2
+    /*mm2=t5'=t2-t5*/
+    psubw mm2,mm5
+    psllw mm3,2
+    paddw mm5,mm5
+    psllw mm4,2
+    /*mm3=t4'=t3-t4*/
+    psubw mm3,mm4
+    paddw mm4,mm4
+    /*mm7=t0'=t0+t7*/
+    paddw mm7,mm0
+    /*mm6=t1'=t1+t6*/
+    paddw mm6,mm1
+    /*mm5=t2'=t2+t5*/
+    paddw mm5,mm2
+    /*mm4=t3'=t3+t4*/
+    paddw mm4,mm3
+    OC_FDCT8x4(0x08,0x18,0x28,0x38,0x48,0x58,0x68,0x78)
+    OC_TRANSPOSE8x4(0x08,0x18,0x28,0x38,0x48,0x58,0x68,0x78)
+    /*Here the first 4x4 block of output from the last transpose is the second
+       4x4 block of input for the next transform.
+      We have cleverly arranged that it already be in the appropriate place,
+       so we only have to do half the stores and loads.*/
+    movq mm0,[0x00+Y]
+    movq [0x58+Y],mm1
+    movq mm1,[0x10+Y]
+    movq [0x68+Y],mm2
+    movq mm2,[0x20+Y]
+    movq [0x78+Y],mm3
+    movq mm3,[0x30+Y]
+    OC_FDCT_STAGE1_8x4
+    OC_FDCT8x4(0x00,0x10,0x20,0x30,0x08,0x18,0x28,0x38)
+    OC_TRANSPOSE8x4(0x00,0x10,0x20,0x30,0x08,0x18,0x28,0x38)
+    /*mm0={-2}x4*/
+    pcmpeqw mm0,mm0
+    paddw mm0,mm0
+    /*Round the results.*/
+    psubw mm1,mm0
+    psubw mm2,mm0
+    psraw mm1,2
+    psubw mm3,mm0
+    movq [0x18+Y],mm1
+    psraw mm2,2
+    psubw mm4,mm0
+    movq mm1,[0x08+Y]
+    psraw mm3,2
+    psubw mm5,mm0
+    psraw mm4,2
+    psubw mm6,mm0
+    psraw mm5,2
+    psubw mm7,mm0
+    psraw mm6,2
+    psubw mm1,mm0
+    psraw mm7,2
+    movq mm0,[0x40+Y]
+    psraw mm1,2
+    movq [0x30+Y],mm7
+    movq mm7,[0x78+Y]
+    movq [0x08+Y],mm1
+    movq mm1,[0x50+Y]
+    movq [0x20+Y],mm6
+    movq mm6,[0x68+Y]
+    movq [0x28+Y],mm2
+    movq mm2,[0x60+Y]
+    movq [0x10+Y],mm5
+    movq mm5,[0x58+Y]
+    movq [0x38+Y],mm3
+    movq mm3,[0x70+Y]
+    movq [0x00+Y],mm4
+    movq mm4,[0x48+Y]
+    OC_FDCT_STAGE1_8x4
+    OC_FDCT8x4(0x40,0x50,0x60,0x70,0x48,0x58,0x68,0x78)
+    OC_TRANSPOSE8x4(0x40,0x50,0x60,0x70,0x48,0x58,0x68,0x78)
+    /*mm0={-2}x4*/
+    pcmpeqw mm0,mm0
+    paddw mm0,mm0
+    /*Round the results.*/
+    psubw mm1,mm0
+    psubw mm2,mm0
+    psraw mm1,2
+    psubw mm3,mm0
+    movq [0x58+Y],mm1
+    psraw mm2,2
+    psubw mm4,mm0
+    movq mm1,[0x48+Y]
+    psraw mm3,2
+    psubw mm5,mm0
+    movq [0x68+Y],mm2
+    psraw mm4,2
+    psubw mm6,mm0
+    movq [0x78+Y],mm3
+    psraw mm5,2
+    psubw mm7,mm0
+    movq [0x40+Y],mm4
+    psraw mm6,2
+    psubw mm1,mm0
+    movq [0x50+Y],mm5
+    psraw mm7,2
+    movq [0x60+Y],mm6
+    psraw mm1,2
+    movq [0x70+Y],mm7
+    movq [0x48+Y],mm1
+#undef Y
+#undef A
+#undef X
+  }
+}
+
+#endif
diff --git a/thirdparty/nanosvg/LICENSE.txt b/thirdparty/nanosvg/LICENSE.txt
index 6fde401cb2..f896f2eb0f 100644
--- a/thirdparty/nanosvg/LICENSE.txt
+++ b/thirdparty/nanosvg/LICENSE.txt
@@ -1,18 +1,18 @@
-Copyright (c) 2013-14 Mikko Mononen memon@inside.org
-
-This software is provided 'as-is', without any express or implied
-warranty.  In no event will the authors be held liable for any damages
-arising from the use of this software.
-
-Permission is granted to anyone to use this software for any purpose,
-including commercial applications, and to alter it and redistribute it
-freely, subject to the following restrictions:
-
-1. The origin of this software must not be misrepresented; you must not
-claim that you wrote the original software. If you use this software
-in a product, an acknowledgment in the product documentation would be
-appreciated but is not required.
-2. Altered source versions must be plainly marked as such, and must not be
-misrepresented as being the original software.
-3. This notice may not be removed or altered from any source distribution.
-
+Copyright (c) 2013-14 Mikko Mononen memon@inside.org
+
+This software is provided 'as-is', without any express or implied
+warranty.  In no event will the authors be held liable for any damages
+arising from the use of this software.
+
+Permission is granted to anyone to use this software for any purpose,
+including commercial applications, and to alter it and redistribute it
+freely, subject to the following restrictions:
+
+1. The origin of this software must not be misrepresented; you must not
+claim that you wrote the original software. If you use this software
+in a product, an acknowledgment in the product documentation would be
+appreciated but is not required.
+2. Altered source versions must be plainly marked as such, and must not be
+misrepresented as being the original software.
+3. This notice may not be removed or altered from any source distribution.
+
author	Rémi Verschelde <rverschelde@gmail.com>	2017-11-05 11:37:59 +0100
committer	Rémi Verschelde <rverschelde@gmail.com>	2017-11-05 11:37:59 +0100
commit	5bc2cf257b46b7ba52c95e43c9b0f91f6e06998e (patch)
tree	fe226ce29e8cef979492b4778c65bab6109191e5 /thirdparty
parent	a89fa34c21103430b1d140ee04c3ae6a433d77ce (diff)