«A dissertation submitted to the Department Of Computer Science, Faculty of Science at the University Of Cape Town in partial fulfilment of the ...»
ITU-T Series H Supplement 1 describes the factors to be taken into account when low bit-rate video is used for Sign Language and lip-reading telecommunications. The document sets out performance requirements that should be met to ensure a successful person-to-person conversation using a video communication system. In setting the requirements, video compression is ignored and the focus is on the resolution and frame rate. The stated requirements though should not be taken as fixed and absolute, but depending on the situation may need to be more stringent, or more relaxed.
The document shows that 20 frames per second provide good usability for both sign language and lip-reading, while still understandable at 12 frames per second. Between 8 and 12 frames per second usability becomes very limited, with no practical usefulness below 8 frames per second.
When looking at resolution for person-to-person sign language video communication, Supplement 1 concludes that it is possible to use Quarter Common Intermediate Format (QCIF) (176 x 144 pixels) resolution, with an increase to CIF (352 x 288 pixels) giving better language perception. Sub Quarter Common Intermediate Format (SQCIF) (112 x 96 pixels) is too coarse for reliable perception, with some signs occasionally perceivable.
The application profile concludes with the basic performance goal of aiming for 25-30 frames per second at CIF (352 x 288 pixels) resolution, while if needed in very low bit-rate environments dropping the frame rate to 12-15 frames per second at a resolution of 176 x 144 pixels.
2.6.2 Subjective and Objective evaluation of video quality When looking at Sign Language communications over limited bandwidth communications channels such as the cellular telephone network an appropriate quality measurement is needed to compare different video parameters. In a subjective evaluation video sequences are shown to a group of viewers The viewers opinion of the video material is then captured, assigned a numeric value and averaged to provide a quality measurement for the video sequence. The details of the testing can vary depending on the objective of the testing and the aspect of the video that is being evaluated.
Objective video quality metrics are mathematical models that approximate results of subjective quality assessments as closely as possible. Video quality metrics such as mean square error (MSE) and peak signal-to-noise ratio (PSNR) are the most widely used objective measures for evaluating video. These measurement techniques though are focused on traditional quality in terms of aesthetics. Recent objective quality measures, modelling the human visual system, have shown substantial improvements over MSE and PSNR in predicting aesthetic quality. But as Ciaramello et al.  state, sign language video is a communications tool, and quality must be judged in terms of intelligibility.
Ciaramello et al.  demonstrated that PSNR is not a good measure of intelligibility in Sign Language video material, and proceeded to propose and evaluate a metric based on the spatial structure of ASL and as a function of MSE in both the hands and the face. The proposed metric gave a substantial improvement over PSNR.
The user experience of MobileASL was evaluated in a laboratory setting, with both subjective and objective measures . The subjective measurements were done in a conversational setting, with two participants conversing in Sign Language using cellphones. The quality of the video was measured subjectively by how hard or easy it was to understand. This was done through a 5
question questionnaire. The survey questions were the following:
1. During the video, how often did you have to guess what the signer was saying (where 1 is never and 5 is all the time)?
2. How difficult would you say it was to comprehend the video (where 1 is very easy and 5 is very difficult)?
3. Changing the frame rate of the video can be distracting. How would you rate the annoyance level of the video (where 1 is not annoying at all and 5 is extremely annoying)?
4. The video quality over a cell phone is not as good as video quality when communicating via the Internet (e.g., by using a web cam) or over a set top box. However, cell phones are convenient since they are mobile. Given the quality of conversation you just experienced, how often would you use the mobile phone for making video calls versus just using your regular version of communication (e.g., go home to use the Internet or set top box, or just text)?
5. If video of this quality were available on the cell phone, would you use it?
The objective measure of the video quality was made through a count of the number of repair requests, for each repair request the number of times the requester asked for a repeat was counted, as well as a count of conversational breakdowns. This was all calculated from the videotaped user study sessions, during which participants were having conversations using phones set on a table in front of them.
As Nakazono et al.  state, in evaluating Sign Language video we must evaluate how well the linguistic information is transmitted and should be careful not to be swayed by impression of the appearance of the video. They used two kinds of evaluations, the intelligibility test and the opinion test.
In the intelligibility test a short video sequence of sign language is presented to subjects, subjects are instructed to write down the contents of the sentences, dictated sentences are then evaluated from 0 to 3, keeping in mind to be careful not to be affected by the difference in subjects’ ability in written language.
In the opinion test a short video sequence of sign language is presented to subjects, subjects were asked to evaluate the intelligibility of the sign language at five levels, from 1 to 5, and the mean value of the score is used for the evaluated value of the data. In the above study subjects were asked to evaluate the intelligibility of the sign video, and not to evaluate the preference of picture quality.
Ciaramello et al.  used a four-question, multiple-choice survey given on a computer at the end of each video in their subjective sign language video evaluation. The first question, “What was the name of the main character in the story?” was asked to encourage the participants to pay close attention to the contents of the video, and was not used in any statistical tabulation. The second question was “How difficult would you say it was to comprehend the video?” with five possible answers: very easy (1.00), easy (0.75), neither easy nor difficult (0.50), difficult (0.25) and very difficult (0.00). The third question asked “How would you rate the annoyance level of the video?” this time with four possible answers: not at all annoying (1.00), a little annoying (0.66), somewhat annoying (0.33) and lastly extremely annoying (0.00). The fourth question asked of the participant would use a video cell phone at this video quality. The subjective intelligibility and annoyance ratings for each video were calculated for each video by averaging each participant’s answers to the two questions.
2.7 Summary Sign Language being a visual language, conveying meaning through a combination of hand shapes, movement of hands and arms, in addition to facial expressions, requires a visual telecommunication channel, making video the only appropriate means of first language telecommunications for the Deaf community.
Video quality can be evaluated either subjectively, capturing viewers opinion of video material, or objectively, using mathematical analysis of the video. The objective evaluations, although good at predicting perceived quality in terms of aesthetics, are not as applicable to quantifying the intelligibility of video material as a lot more is involved than purely if it looks good. In addition Sign Language is not a single language but has many variations across the world, as well as different dialects within the same sign language, such as SASL.
Video communications using mobile phones provides three main challenges, low bandwidth, low processing speed and limited battery life. In an attempt to overcome these challenges Sign Language specific video compression techniques have been investigated, but these techniques rely on modified versions of the standard video encoders to provide better compression, and this is not possible to implement on all phones, especially at the lower end of the market (the target audience of this research).
This research is not focused on video compression schemes, but on the effect of the reduction of video resolution and frame rate on the intelligibility of video containing SASL. The objective is to evaluate the intelligibility of the sign language video, not the picture quality of the video.
3 Pilot user study (Experiment 1) Based on the ITU requirements and limitations (see Section 2.8.1), and the aim of subjective evaluation of Sign Language video on a cell phone a pilot study was conducted to validate the questionnaire with the Deaf participants for evaluating the intelligibility of SASL video on a cell phone (see Appendix A).
3.1 Aim The pilot user study aimed to validate the questionnaire with the Deaf participants for evaluating the intelligibility of SASL video on a cell phone, to uncover any problems with the planned experimental setup. Reducing the video resolution and frame rate is the simplest way to reduce video file size, and thus the required amount of data to transfer over the cell phone network. This experiment only looked at the impact of video resolution and frame rate, keeping compression constrained to 256 kbps in all of the test videos.
3.2 Background The size of a video file is determined by three basic settings: the video resolution (spatial resolution), video frame rate (temporal resolution) and how the video has been compressed.
3.2.1 Video Resolution Video resolution is the size (width and height) of the frames in the video. The lower the resolution the less detail in the video content and the less storage is needed per video frame.
This experiment will be looking at two resolutions, namely:
320 x 240 (Quarter Video Graphics Array (QVGA)) 174 x 144 (3GP) The resolution of 352 x 288, although an industry standard resolution for video compression and used for capturing video on cell phones, is a higher resolution than the physical screen on the cell phones used can display and was for this reason dropped from the study. It would have been nice to go above 320 x 240, but the standards for cheaper cell phones meant this was not feasible.
3.2.2 Video Frame Rate The video frame rate is the number of frames of video stored and displayed per second. The lower the frame rate the less storage is needed per second of video, but at lower frame rates less detail is visible of objects in motion and blurring of the image starts occurring, which can become a problem especially in Sign Language.
This experiment will be looking at the following three frame rate values:
30 frames per second 15 frames per second 10 frames per second 3.2.3 Video Compression Video compression is used to process the frames of the video, at the given resolution and frame rate, to further reduce the amount of storage required by the video. The size reduction and resulting quality of the final video is dependent on not only which video compression algorithm was used, but also which compression and quality settings were used. But in general the more the video is compressed the lower the quality and the smaller the file size.
In this experiment video compression was kept to a minimum and consistent throughout the twelve video clips, to be able to see only the impact that resolution and frame rate has on the size and the intelligibility of the video.
3.3 Procedure 3.3.1 Participants Five adult members of the Deaf community (five men, no woman) ranging in age from 33 to 46 (mean = 36) participated in this study. All were native signers and have used SASL as their principal mode of communications all their lives. The five participants were all staff members of DCCT, and had English as their language of literacy, regardless of what their hearing families used.
All participants were introduced to the experiment and each signed a consent form to confirm that they fully understand the project, agree to participate and understand that all information provided would be kept confidential.
3.3.2 Experimental setup The participants were gathered in high ceilinged, open room with fluorescent lighting and windows on one side. They were seated at desks arranged in a half circle, two participants to a desk, with a pack of 12 questionnaires each numbered with A1-A12, B1-B12, and so forth, a pen, as well as a Nokia N96 cell phone preloaded with the correspondingly numbered video clips in front of each participant.
All communications between the researcher and participants were interpreted by a certified SASL interpreter who was known to the participants. Although the questionnaires were explained in SASL and all queries were answered through the SASL interpreter, the questionnaires were provided in written English and answered in written English.
The participants were introduced to the experiment with the help of the SASL interpreter. It was made clear during the introduction that the focus of the experiment was on evaluating the quality of the video clips and the intelligibility of the SASL in the video clips at different quality settings, and not to evaluate the participants’ proficiency in SASL.
Seeing that written/spoken English is not the participants’ first language, and the questionnaire required the participants to write down what they understood the Sign Language video clip contained, all participants were asked if they are comfortable writing their answers out. They were given the option of giving their responses to the questionnaire through the interpreter. None of the participants took this option, and indicated that they were comfortable with reading the questionnaire and writing down their responses in English.
The participants were asked to view each video clip only once and then finish the questionnaire for that clip, without reviewing the clip, rating the intelligibility of that video clip. This was done to get the participants initial response to the video clip, and not allow the participant to try and review sections of the clip that were unclear. If any sections were unclear that should be reflected in the answers for that clip.