| ============================================= |
| Snow Video Codec Specification Draft 20080110 |
| ============================================= |
| |
| Introduction: |
| ============= |
| This specification describes the Snow bitstream syntax and semantics as |
| well as the formal Snow decoding process. |
| |
| The decoding process is described precisely and any compliant decoder |
| MUST produce the exact same output for a spec-conformant Snow stream. |
| For encoding, though, any process which generates a stream compliant to |
| the syntactical and semantic requirements and which is decodable by |
| the process described in this spec shall be considered a conformant |
| Snow encoder. |
| |
| Definitions: |
| ============ |
| |
| MUST the specific part must be done to conform to this standard |
| SHOULD it is recommended to be done that way, but not strictly required |
| |
| ilog2(x) is the rounded down logarithm of x with basis 2 |
| ilog2(0) = 0 |
| |
| Type definitions: |
| ================= |
| |
| b 1-bit range coded |
| u unsigned scalar value range coded |
| s signed scalar value range coded |
| |
| |
| Bitstream syntax: |
| ================= |
| |
| frame: |
| header |
| prediction |
| residual |
| |
| header: |
| keyframe b MID_STATE |
| if(keyframe || always_reset) |
| reset_contexts |
| if(keyframe){ |
| version u header_state |
| always_reset b header_state |
| temporal_decomposition_type u header_state |
| temporal_decomposition_count u header_state |
| spatial_decomposition_count u header_state |
| colorspace_type u header_state |
| if (nb_planes > 2) { |
| chroma_h_shift u header_state |
| chroma_v_shift u header_state |
| } |
| spatial_scalability b header_state |
| max_ref_frames-1 u header_state |
| qlogs |
| } |
| if(!keyframe){ |
| update_mc b header_state |
| if(update_mc){ |
| for(plane=0; plane<nb_plane_types; plane++){ |
| diag_mc b header_state |
| htaps/2-1 u header_state |
| for(i= p->htaps/2; i; i--) |
| |hcoeff[i]| u header_state |
| } |
| } |
| update_qlogs b header_state |
| if(update_qlogs){ |
| spatial_decomposition_count u header_state |
| qlogs |
| } |
| } |
| |
| spatial_decomposition_type s header_state |
| qlog s header_state |
| mv_scale s header_state |
| qbias s header_state |
| block_max_depth s header_state |
| |
| qlogs: |
| for(plane=0; plane<nb_plane_types; plane++){ |
| quant_table[plane][0][0] s header_state |
| for(level=0; level < spatial_decomposition_count; level++){ |
| quant_table[plane][level][1]s header_state |
| quant_table[plane][level][3]s header_state |
| } |
| } |
| |
| reset_contexts |
| *_state[*]= MID_STATE |
| |
| prediction: |
| for(y=0; y<block_count_vertical; y++) |
| for(x=0; x<block_count_horizontal; x++) |
| block(0) |
| |
| block(level): |
| mvx_diff=mvy_diff=y_diff=cb_diff=cr_diff=0 |
| if(keyframe){ |
| intra=1 |
| }else{ |
| if(level!=max_block_depth){ |
| s_context= 2*left->level + 2*top->level + topleft->level + topright->level |
| leaf b block_state[4 + s_context] |
| } |
| if(level==max_block_depth || leaf){ |
| intra b block_state[1 + left->intra + top->intra] |
| if(intra){ |
| y_diff s block_state[32] |
| cb_diff s block_state[64] |
| cr_diff s block_state[96] |
| }else{ |
| ref_context= ilog2(2*left->ref) + ilog2(2*top->ref) |
| if(ref_frames > 1) |
| ref u block_state[128 + 1024 + 32*ref_context] |
| mx_context= ilog2(2*abs(left->mx - top->mx)) |
| my_context= ilog2(2*abs(left->my - top->my)) |
| mvx_diff s block_state[128 + 32*(mx_context + 16*!!ref)] |
| mvy_diff s block_state[128 + 32*(my_context + 16*!!ref)] |
| } |
| }else{ |
| block(level+1) |
| block(level+1) |
| block(level+1) |
| block(level+1) |
| } |
| } |
| |
| |
| residual: |
| residual2(luma) |
| if (nb_planes > 2) { |
| residual2(chroma_cr) |
| residual2(chroma_cb) |
| } |
| |
| residual2: |
| for(level=0; level<spatial_decomposition_count; level++){ |
| if(level==0) |
| subband(LL, 0) |
| subband(HL, level) |
| subband(LH, level) |
| subband(HH, level) |
| } |
| |
| subband: |
| FIXME |
| |
| nb_plane_types = gray ? 1 : 2; |
| |
| Tag description: |
| ---------------- |
| |
| version |
| 0 |
| this MUST NOT change within a bitstream |
| |
| always_reset |
| if 1 then the range coder contexts will be reset after each frame |
| |
| temporal_decomposition_type |
| 0 |
| |
| temporal_decomposition_count |
| 0 |
| |
| spatial_decomposition_count |
| FIXME |
| |
| colorspace_type |
| 0 unspecified YCbCr |
| 1 Gray |
| 2 Gray + Alpha |
| 3 GBR |
| 4 GBRA |
| this MUST NOT change within a bitstream |
| |
| chroma_h_shift |
| log2(luma.width / chroma.width) |
| this MUST NOT change within a bitstream |
| |
| chroma_v_shift |
| log2(luma.height / chroma.height) |
| this MUST NOT change within a bitstream |
| |
| spatial_scalability |
| 0 |
| |
| max_ref_frames |
| maximum number of reference frames |
| this MUST NOT change within a bitstream |
| |
| update_mc |
| indicates that motion compensation filter parameters are stored in the |
| header |
| |
| diag_mc |
| flag to enable faster diagonal interpolation |
| this SHOULD be 1 unless it turns out to be covered by a valid patent |
| |
| htaps |
| number of half pel interpolation filter taps, MUST be even, >0 and <10 |
| |
| hcoeff |
| half pel interpolation filter coefficients, hcoeff[0] are the 2 middle |
| coefficients [1] are the next outer ones and so on, resulting in a filter |
| like: ...eff[2], hcoeff[1], hcoeff[0], hcoeff[0], hcoeff[1], hcoeff[2] ... |
| the sign of the coefficients is not explicitly stored but alternates |
| after each coeff and coeff[0] is positive, so ...,+,-,+,-,+,+,-,+,-,+,... |
| hcoeff[0] is not explicitly stored but found by subtracting the sum |
| of all stored coefficients with signs from 32 |
| hcoeff[0]= 32 - hcoeff[1] - hcoeff[2] - ... |
| a good choice for hcoeff and htaps is |
| htaps= 6 |
| hcoeff={40,-10,2} |
| an alternative which requires more computations at both encoder and |
| decoder side and may or may not be better is |
| htaps= 8 |
| hcoeff={42,-14,6,-2} |
| |
| |
| ref_frames |
| minimum of the number of available reference frames and max_ref_frames |
| for example the first frame after a key frame always has ref_frames=1 |
| |
| spatial_decomposition_type |
| wavelet type |
| 0 is a 9/7 symmetric compact integer wavelet |
| 1 is a 5/3 symmetric compact integer wavelet |
| others are reserved |
| stored as delta from last, last is reset to 0 if always_reset || keyframe |
| |
| qlog |
| quality (logarithmic quantizer scale) |
| stored as delta from last, last is reset to 0 if always_reset || keyframe |
| |
| mv_scale |
| stored as delta from last, last is reset to 0 if always_reset || keyframe |
| FIXME check that everything works fine if this changes between frames |
| |
| qbias |
| dequantization bias |
| stored as delta from last, last is reset to 0 if always_reset || keyframe |
| |
| block_max_depth |
| maximum depth of the block tree |
| stored as delta from last, last is reset to 0 if always_reset || keyframe |
| |
| quant_table |
| quantization table |
| |
| |
| Highlevel bitstream structure: |
| ============================== |
| -------------------------------------------- |
| | Header | |
| -------------------------------------------- |
| | ------------------------------------ | |
| | | Block0 | | |
| | | split? | | |
| | | yes no | | |
| | | ......... intra? | | |
| | | : Block01 : yes no | | |
| | | : Block02 : ....... .......... | | |
| | | : Block03 : : y DC : : ref index: | | |
| | | : Block04 : : cb DC : : motion x : | | |
| | | ......... : cr DC : : motion y : | | |
| | | ....... .......... | | |
| | ------------------------------------ | |
| | ------------------------------------ | |
| | | Block1 | | |
| | ... | |
| -------------------------------------------- |
| | ------------ ------------ ------------ | |
| || Y subbands | | Cb subbands| | Cr subbands|| |
| || --- --- | | --- --- | | --- --- || |
| || |LL0||HL0| | | |LL0||HL0| | | |LL0||HL0| || |
| || --- --- | | --- --- | | --- --- || |
| || --- --- | | --- --- | | --- --- || |
| || |LH0||HH0| | | |LH0||HH0| | | |LH0||HH0| || |
| || --- --- | | --- --- | | --- --- || |
| || --- --- | | --- --- | | --- --- || |
| || |HL1||LH1| | | |HL1||LH1| | | |HL1||LH1| || |
| || --- --- | | --- --- | | --- --- || |
| || --- --- | | --- --- | | --- --- || |
| || |HH1||HL2| | | |HH1||HL2| | | |HH1||HL2| || |
| || ... | | ... | | ... || |
| | ------------ ------------ ------------ | |
| -------------------------------------------- |
| |
| Decoding process: |
| ================= |
| |
| ------------ |
| | | |
| | Subbands | |
| ------------ | | |
| | | ------------ |
| | Intra DC | | |
| | | LL0 subband prediction |
| ------------ | |
| \ Dequantization |
| ------------------- \ | |
| | Reference frames | \ IDWT |
| | ------- ------- | Motion \ | |
| ||Frame 0| |Frame 1|| Compensation . OBMC v ------- |
| | ------- ------- | --------------. \------> + --->|Frame n|-->output |
| | ------- ------- | ------- |
| ||Frame 2| |Frame 3||<----------------------------------/ |
| | ... | |
| ------------------- |
| |
| |
| Range Coder: |
| ============ |
| |
| Binary Range Coder: |
| ------------------- |
| The implemented range coder is an adapted version based upon "Range encoding: |
| an algorithm for removing redundancy from a digitised message." by G. N. N. |
| Martin. |
| The symbols encoded by the Snow range coder are bits (0|1). The |
| associated probabilities are not fix but change depending on the symbol mix |
| seen so far. |
| |
| |
| bit seen | new state |
| ---------+----------------------------------------------- |
| 0 | 256 - state_transition_table[256 - old_state]; |
| 1 | state_transition_table[ old_state]; |
| |
| state_transition_table = { |
| 0, 0, 0, 0, 0, 0, 0, 0, 20, 21, 22, 23, 24, 25, 26, 27, |
| 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 37, 38, 39, 40, 41, 42, |
| 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 56, 57, |
| 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, |
| 74, 75, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, |
| 89, 90, 91, 92, 93, 94, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, |
| 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 114, 115, 116, 117, 118, |
| 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 133, |
| 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, |
| 150, 151, 152, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, |
| 165, 166, 167, 168, 169, 170, 171, 171, 172, 173, 174, 175, 176, 177, 178, 179, |
| 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 190, 191, 192, 194, 194, |
| 195, 196, 197, 198, 199, 200, 201, 202, 202, 204, 205, 206, 207, 208, 209, 209, |
| 210, 211, 212, 213, 215, 215, 216, 217, 218, 219, 220, 220, 222, 223, 224, 225, |
| 226, 227, 227, 229, 229, 230, 231, 232, 234, 234, 235, 236, 237, 238, 239, 240, |
| 241, 242, 243, 244, 245, 246, 247, 248, 248, 0, 0, 0, 0, 0, 0, 0}; |
| |
| FIXME |
| |
| |
| Range Coding of integers: |
| ------------------------- |
| FIXME |
| |
| |
| Neighboring Blocks: |
| =================== |
| left and top are set to the respective blocks unless they are outside of |
| the image in which case they are set to the Null block |
| |
| top-left is set to the top left block unless it is outside of the image in |
| which case it is set to the left block |
| |
| if this block has no larger parent block or it is at the left side of its |
| parent block and the top right block is not outside of the image then the |
| top right block is used for top-right else the top-left block is used |
| |
| Null block |
| y,cb,cr are 128 |
| level, ref, mx and my are 0 |
| |
| |
| Motion Vector Prediction: |
| ========================= |
| 1. the motion vectors of all the neighboring blocks are scaled to |
| compensate for the difference of reference frames |
| |
| scaled_mv= (mv * (256 * (current_reference+1) / (mv.reference+1)) + 128)>>8 |
| |
| 2. the median of the scaled left, top and top-right vectors is used as |
| motion vector prediction |
| |
| 3. the used motion vector is the sum of the predictor and |
| (mvx_diff, mvy_diff)*mv_scale |
| |
| |
| Intra DC Prediction: |
| ==================== |
| the luma and chroma values of the left block are used as predictors |
| |
| the used luma and chroma is the sum of the predictor and y_diff, cb_diff, cr_diff |
| to reverse this in the decoder apply the following: |
| block[y][x].dc[0] = block[y][x-1].dc[0] + y_diff; |
| block[y][x].dc[1] = block[y][x-1].dc[1] + cb_diff; |
| block[y][x].dc[2] = block[y][x-1].dc[2] + cr_diff; |
| block[*][-1].dc[*]= 128; |
| |
| |
| Motion Compensation: |
| ==================== |
| |
| Halfpel interpolation: |
| ---------------------- |
| Halfpel interpolation is done by convolution with the halfpel filter stored |
| in the header: |
| |
| horizontal halfpel samples are found by |
| H1[y][x] = hcoeff[0]*(F[y][x ] + F[y][x+1]) |
| + hcoeff[1]*(F[y][x-1] + F[y][x+2]) |
| + hcoeff[2]*(F[y][x-2] + F[y][x+3]) |
| + ... |
| h1[y][x] = (H1[y][x] + 32)>>6; |
| |
| vertical halfpel samples are found by |
| H2[y][x] = hcoeff[0]*(F[y ][x] + F[y+1][x]) |
| + hcoeff[1]*(F[y-1][x] + F[y+2][x]) |
| + ... |
| h2[y][x] = (H2[y][x] + 32)>>6; |
| |
| vertical+horizontal halfpel samples are found by |
| H3[y][x] = hcoeff[0]*(H2[y][x ] + H2[y][x+1]) |
| + hcoeff[1]*(H2[y][x-1] + H2[y][x+2]) |
| + ... |
| H3[y][x] = hcoeff[0]*(H1[y ][x] + H1[y+1][x]) |
| + hcoeff[1]*(H1[y+1][x] + H1[y+2][x]) |
| + ... |
| h3[y][x] = (H3[y][x] + 2048)>>12; |
| |
| |
| F H1 F |
| | | | |
| | | | |
| | | | |
| F H1 F |
| | | | |
| | | | |
| | | | |
| F-------F-------F-> H1<-F-------F-------F |
| v v v |
| H2 H3 H2 |
| ^ ^ ^ |
| F-------F-------F-> H1<-F-------F-------F |
| | | | |
| | | | |
| | | | |
| F H1 F |
| | | | |
| | | | |
| | | | |
| F H1 F |
| |
| |
| unavailable fullpel samples (outside the picture for example) shall be equal |
| to the closest available fullpel sample |
| |
| |
| Smaller pel interpolation: |
| -------------------------- |
| if diag_mc is set then points which lie on a line between 2 vertically, |
| horizontally or diagonally adjacent halfpel points shall be interpolated |
| linearly with rounding to nearest and halfway values rounded up. |
| points which lie on 2 diagonals at the same time should only use the one |
| diagonal not containing the fullpel point |
| |
| |
| |
| F-->O---q---O<--h1->O---q---O<--F |
| v \ / v \ / v |
| O O O O O O O |
| | / | \ | |
| q q q q q |
| | / | \ | |
| O O O O O O O |
| ^ / \ ^ / \ ^ |
| h2-->O---q---O<--h3->O---q---O<--h2 |
| v \ / v \ / v |
| O O O O O O O |
| | \ | / | |
| q q q q q |
| | \ | / | |
| O O O O O O O |
| ^ / \ ^ / \ ^ |
| F-->O---q---O<--h1->O---q---O<--F |
| |
| |
| |
| the remaining points shall be bilinearly interpolated from the |
| up to 4 surrounding halfpel and fullpel points, again rounding should be to |
| nearest and halfway values rounded up |
| |
| compliant Snow decoders MUST support 1-1/8 pel luma and 1/2-1/16 pel chroma |
| interpolation at least |
| |
| |
| Overlapped block motion compensation: |
| ------------------------------------- |
| FIXME |
| |
| LL band prediction: |
| =================== |
| Each sample in the LL0 subband is predicted by the median of the left, top and |
| left+top-topleft samples, samples outside the subband shall be considered to |
| be 0. To reverse this prediction in the decoder apply the following. |
| for(y=0; y<height; y++){ |
| for(x=0; x<width; x++){ |
| sample[y][x] += median(sample[y-1][x], |
| sample[y][x-1], |
| sample[y-1][x]+sample[y][x-1]-sample[y-1][x-1]); |
| } |
| } |
| sample[-1][*]=sample[*][-1]= 0; |
| width,height here are the width and height of the LL0 subband not of the final |
| video |
| |
| |
| Dequantization: |
| =============== |
| FIXME |
| |
| Wavelet Transform: |
| ================== |
| |
| Snow supports 2 wavelet transforms, the symmetric biorthogonal 5/3 integer |
| transform and an integer approximation of the symmetric biorthogonal 9/7 |
| daubechies wavelet. |
| |
| 2D IDWT (inverse discrete wavelet transform) |
| -------------------------------------------- |
| The 2D IDWT applies a 2D filter recursively, each time combining the |
| 4 lowest frequency subbands into a single subband until only 1 subband |
| remains. |
| The 2D filter is done by first applying a 1D filter in the vertical direction |
| and then applying it in the horizontal one. |
| --------------- --------------- --------------- --------------- |
| |LL0|HL0| | | | | | | | | | | | |
| |---+---| HL1 | | L0|H0 | HL1 | | LL1 | HL1 | | | | |
| |LH0|HH0| | | | | | | | | | | | |
| |-------+-------|->|-------+-------|->|-------+-------|->| L1 | H1 |->... |
| | | | | | | | | | | | | |
| | LH1 | HH1 | | LH1 | HH1 | | LH1 | HH1 | | | | |
| | | | | | | | | | | | | |
| --------------- --------------- --------------- --------------- |
| |
| |
| 1D Filter: |
| ---------- |
| 1. interleave the samples of the low and high frequency subbands like |
| s={L0, H0, L1, H1, L2, H2, L3, H3, ... } |
| note, this can end with a L or a H, the number of elements shall be w |
| s[-1] shall be considered equivalent to s[1 ] |
| s[w ] shall be considered equivalent to s[w-2] |
| |
| 2. perform the lifting steps in order as described below |
| |
| 5/3 Integer filter: |
| 1. s[i] -= (s[i-1] + s[i+1] + 2)>>2; for all even i < w |
| 2. s[i] += (s[i-1] + s[i+1] )>>1; for all odd i < w |
| |
| \ | /|\ | /|\ | /|\ | /|\ |
| \|/ | \|/ | \|/ | \|/ | |
| + | + | + | + | -1/4 |
| /|\ | /|\ | /|\ | /|\ | |
| / | \|/ | \|/ | \|/ | \|/ |
| | + | + | + | + +1/2 |
| |
| |
| Snow's 9/7 Integer filter: |
| 1. s[i] -= (3*(s[i-1] + s[i+1]) + 4)>>3; for all even i < w |
| 2. s[i] -= s[i-1] + s[i+1] ; for all odd i < w |
| 3. s[i] += ( s[i-1] + s[i+1] + 4*s[i] + 8)>>4; for all even i < w |
| 4. s[i] += (3*(s[i-1] + s[i+1]) )>>1; for all odd i < w |
| |
| \ | /|\ | /|\ | /|\ | /|\ |
| \|/ | \|/ | \|/ | \|/ | |
| + | + | + | + | -3/8 |
| /|\ | /|\ | /|\ | /|\ | |
| / | \|/ | \|/ | \|/ | \|/ |
| (| + (| + (| + (| + -1 |
| \ + /|\ + /|\ + /|\ + /|\ +1/4 |
| \|/ | \|/ | \|/ | \|/ | |
| + | + | + | + | +1/16 |
| /|\ | /|\ | /|\ | /|\ | |
| / | \|/ | \|/ | \|/ | \|/ |
| | + | + | + | + +3/2 |
| |
| optimization tips: |
| following are exactly identical |
| (3a)>>1 == a + (a>>1) |
| (a + 4b + 8)>>4 == ((a>>2) + b + 2)>>2 |
| |
| 16bit implementation note: |
| The IDWT can be implemented with 16bits, but this requires some care to |
| prevent overflows, the following list, lists the minimum number of bits needed |
| for some terms |
| 1. lifting step |
| A= s[i-1] + s[i+1] 16bit |
| 3*A + 4 18bit |
| A + (A>>1) + 2 17bit |
| |
| 3. lifting step |
| s[i-1] + s[i+1] 17bit |
| |
| 4. lifiting step |
| 3*(s[i-1] + s[i+1]) 17bit |
| |
| |
| TODO: |
| ===== |
| Important: |
| finetune initial contexts |
| flip wavelet? |
| try to use the wavelet transformed predicted image (motion compensated image) as context for coding the residual coefficients |
| try the MV length as context for coding the residual coefficients |
| use extradata for stuff which is in the keyframes now? |
| implement per picture halfpel interpolation |
| try different range coder state transition tables for different contexts |
| |
| Not Important: |
| compare the 6 tap and 8 tap hpel filters (psnr/bitrate and subjective quality) |
| spatial_scalability b vs u (!= 0 breaks syntax anyway so we can add a u later) |
| |
| |
| Credits: |
| ======== |
| Michael Niedermayer |
| Loren Merritt |
| |
| |
| Copyright: |
| ========== |
| GPL + GFDL + whatever is needed to make this a RFC |