 Status Closed
 Percent Complete
 Task Type Patches
 Category Codecs
 Assigned To Noone
 Operating System All players
 Severity Low
 Priority Very Low
 Reported Version Release 3.6
 Due in Version Undecided

Due Date
Undecided
 Votes
 Private
FS#11498  Speed optimization to libwmapro
This flyspray contains speed optimizations to libwmapro.
v01 introduces asm routines for multiplications. Furthermore it adds fixmul31 and fixmul16 as faster routines can be used for those.
Decoding speed for wmapro_173k.wma on PP5022: 43.1 MHz (svn: 49.6 MHz)
Test on Coldfire is needed.
Closed by mtarek16
20100805 18:56
Reason for closing: Accepted
Additional comments about closing:
20100805 18:56
Reason for closing: Accepted
Additional comments about closing:
Committed in r27703.
Loading...
Available keyboard shortcuts
 Alt + ⇧ Shift + l Login Dialog / Logout
 Alt + ⇧ Shift + a Add new task
 Alt + ⇧ Shift + m My searches
 Alt + ⇧ Shift + t focus taskid search
Tasklist
 o open selected task
 j move cursor down
 k move cursor up
Task Details
 n Next task
 p Previous task
 Alt + ⇧ Shift + e ↵ Enter Edit this task
 Alt + ⇧ Shift + w watch task
 Alt + ⇧ Shift + y Close Task
Task Editing
 Alt + ⇧ Shift + s save task
For coldfire it probably makes sense to create two variants of vector_fixmul_scalar for the two different shifts used (16 and 24) since the fixmulshift function is pretty slow with 2 branches (othoh those branches could be eliminated here anyway since the shift is known to be less than 31).
edit: the coldifre fixmul32 function in libwma/wmafixed.h is doing the same as the fixmul16 in this patch but is 2 cycles faster and uses one register less
I did not see there are only used two variant of vector_fixmul_scalar(). You are right, we should definately define a 24 and a 16 bitshift variant of either vector_fixmul_scalar().
I will change Coldfire's fixmul16() to your proposed implementation. Is this also valid for other codecs implementations (e.g. atrac, mpc)? Or does this faster implementation use any knowledge about the codec's fixed point representation (e.g. 14 bits fract part)…
it does what the one in this patch does, takes the lower 16 bits of the high half of the 64bit result and make them the high 16 bits of the 32 bit final result and the high 16 bits of the lower half and make them the lower 16 bits of the final 32 bit result or (int32_t)((int64_t)x*y»16), but you are right that it will not work for shifts other than 16.
Define fixmul24(), define two specific variants of vector_fixmul_scalar() (vector_fixmul_scalar_16 and vector_fixmul_scalar_24). Speed is same as measured with v01.
Define IRAM macros for different CPU types. Use IRAM for <WMAProDecodeCtx.tmp> and as many <WMAProChannelCtx.out> as fit to IRAM of the specific model.
Decoding speed for wmapro_173k.wma on PP5022: 37.5 MHz (svn: 49.6 MHz)
Differ between
a) models with large IRAM → put <WMAProDecodeCtx.tmp> to IRAM.
b) models with normal IRAM → cannot put <WMAProDecodeCtx.tmp> to IRAM, but move several window tables to IRAM as second best option.
Edit: Histogram of window length for all rockbox wmapro samples is:
len 128 = 1600 calls
len 256 = 370 calls
len 512 = 370 calls
len 1024 = 960 calls
len 2048 = 5700 calls
len 4096 = 0 calls
Decoding speed for wmapro_173k.wma on PP5022: 37.3 MHz (svn: 49.6 MHz)
Edit 2:
mcf5249 wmapro_173k.wma; svn: 179.40MHz 69.22% realtime; wmapro_v04.patch: 117.63MHz 105.57% realtime
This patch produces non interleaved output from wma pro decoder. Works directly on the buffers in wmaprodec.c instead of copying the data.
Minor change. We do not need vector_fixmul_scalar_16(), use vector_fixmul_scalar_24() instead. vector_fixmul_scalar_16() was only used to scale with sqrt(2), we can use higher precision for this rare use case.
Submitted wmapro_v05 with r27582.
Edit:
Submitted noninterleaving patch with r27583.
Add some comments and unroll loop in vector_fixmul_scalar(). Next step is to add some arm asm for vector_fixmul_scalar() and maybe vector_fixmul_window().
Decoding speed for wmapro_173k.wma on PP5022: 35.6 MHz (svn r27589: 35.9 MHz)
v06 submitted with r27595.
v07: Approach to use larger fract part for mdct and windowing. This preserves higher precision. The larger fract part is introduced via using fixmul16 instead of fixmul24 in the vector_fixmul_scalar() function. The asm for Coldfire has not been changed yet, especially Coldfire will speedup a bit through this change as well.
ToDo: Somebody to change the Coldfire asm and some cleanup of this change.
Doing this with the current coldfire fixmul16 asm is pretty trivial but it think there's a better way: switching the emac to integer mode because we can then get the result with only one multiplication instead of the 2 used in the fixmul16 function, otoh, if this multiplication overflows 48 bits that will not work very well i think, do you know if this can overflow into the top 16 bits of the 64 bit result?
If would overflow into the upper 16 bits the result after fixmul16 would overflow anyway. The deeper I think about this change the less convinced I am it will not internally overflow…
1) Requantized values were seen to be <600 * (1«shift), shift's worst case is for 128size window → shift = 17  (83) = 12. ⇒ 600*(1«12) = ~(1«22)
2) quant (the scaling) was seen to go up to idx 106 ⇒ 10^(156/20) = ~(1«26)
3) This results in (1«22)*(1«26) = (1«48)
quite tight….
Some cleanup. Still the fixmul16implementation must be merged to the Coldfire asm of vector_fixmul_scalar().
v08 was committed in r27703. Thanks !
Task left open till coldfire vector_fixmul_scalar is done.
here are my takes at the CF asm of fixmul_scalar : http://www.pastie.org/1076754
Hi, i tested a version similar to your version 1 and also another approach switching the emac to unsigned integer mode and using that to get the lower half of the result but both turned out to be slower than the c loop using the fixmul16 macro so i think we can leave this as it is and close this task.
edit: my idea above about using the extension word of the emac to the the full result with one multiplication can not work.
Is version 2 correct then ? If so, it would be faster than the c loop.
no version 2 is not correct, we cannot get away with doing only one multiply because the multiplier yields only a 40 bit result, i was confused and thought it gave a 48 bit result but that is not the case
just one thing iiuc, you're talking about integer mode, version 2 is intended to be in fractional mode.
It's wrong .. totally wrong. I'll just remove the coldfire part.