Themen:

http://research.microsoft.com/~horvitz/cacm-attention.htm
http://research.microsoft.com/~horvitz/seer.HTM

  - Motivation
  - (previous approaches and their drawbacks)
  - LHMMs
  - Description of experiments
  - Results
  - further work
    - AHMM, die beiden Papers dazu.


Fragen:
  - LHMM == HHMM?
    Antwort: LHMMs und HHMMs sind nicht dasselbe: Bei den LHMMs
    durchlaufen die Features zuerst alle HMMs der untersten Schicht.
    Eingabe fr die nchsthhere Schicht ist dann der Vektor der
    Klassifikations- ergebnisse. Die HMMs sind also im Grunde
    voneinander getrennt.  Bei den HHMMs hingegen hast Du ein
    einziges Netz von Zustnden, die ihrerseits wiederum HMMs sein
    knnen.
  - What is a binaural microphone?
    - stereo microphone
  - LHMMs formally
    - yes, but not too much detail
  - Baum-Welch Algorithm erklren? Viterbi? nennen, und ein bisschen
    erklren.
  - maxbelief <-> distributional?
    maxbelief: next layer gets the index of the HMM with the maximum
    log-likelihood
    distributional: next layer gets the full vector of
    log-likelihoods
  - audio: linear predictive coding coefficients?
    standard proecdure in signal processing
  - video: why skin color in addition to face pixels?
    ...
  - talk about security?
    maybe
  - what exactly is seer? self-developed?

Folien:

| --
| 
|   Titelfolie
| 
| --
| 
|   Structure of this talk:
| 
| --
| 
|   Motivation: Why human action recognition?
| 
|   (vielleicht zwei bis drei Folien)
| 
|   * automated surveillance
|     - elderly and ill persons
|     - children
|   * multimodal human-computer interaction
|     (- things can only get better than they are)
|     - more natural communication
|     - new kinds of services
| 
| --
| 
|   Aim
| 
|   Improve (or introduce) 'context-awareness' of computers
| 
|   context == identity, location, intentions, recent activities
| 
| --
| 
|   Action Recognition
| 
|   Not simple movements (like waving a hand, pointing gesture), but
|   More complex actions (like talking on the phone, having a f2f discussion)
| 
|   We want that in realtime.
| 
|   Reference: Recognizing American Sign Language
| 
| --
| 
|   Experiment Setup
| 
|   Binaural Microphones (two of them, cpature ambient noise, used for
|     sound classification and localization)
|   USB Camera (30fps, used to determine the number of persons present
|     in the scene)
|   Kayboard and mouse (keep a history of keyboard an mouse events
|     during the last 5 seconds)
| 
| --
| 
|   Classify office situations into the following:
| 
|   * Phone conversation
|   * Face to Face conversation
|   * an ongoing presentation
|   * a distant conversation
|   * nobody is present
|   * a user is present an engaged in some other activity
| 
|   proposed by [...] as indicators for a person's availability
| 
| --
| 
|   Which kind of probabilistic model?
| 
|   Many of the past works successfully used HMMs or extensions, so
|   the authors chose to make a similar approach
| 
|   Other probabilistic models have been used, such as a bayesian
|   approach (see american sign language) or stochastic context-free
|   grammars. (Also: Bayesian networks, see beginning of second page)
| 
| --
| 
|   Repetition: Hidden Markov Model (HMM)
| 
|   * set of states, initial state
|   * state transition probability distribution
|   * ouput probability distribution for each state
| 
| --
| 
|   Repetition: Hidden Markov Model (HMM) (2)
| 
|   Use it to:
|   * Evaluate (Viterbi algorithm)
|   * Decode (Viterbi algorithm)
|   * Train (Baum-Welch algorithm)
| 
| --
| 
|   Repetition: Baum-Welch-Algorithm
| 
|   (maybe later, see "learning seer")
| 
| --
| 
|   Drawbacks of HMMs
| 
|   (see last few sentences on the first page)
| 
|   large parameter space needed for training, classification
|     accuracies too low (last paragraph in the first column of second
|     page)
| 
|   new training was required when moving the system to another place
| 
| --
| 
|   Something better needed
| 
|   robust to changes of lighting and acoustics
| 
|   "search for a representation that maps more naturally to the
|   problem space" (psychologists have found that many human behaviors
|   are hierarchically structured)
| 
|   => multilevel representation needed, for explanations at multiple
|     temporal granularities
| 
| --
| 
|   Layered HMMs (probably more than one slides)
| 
|   * layer architecture where each layer consists of a set of HMMs
|   * each layer is connected to the next one via its 'inferential'
|     results
|   * every layer for a different temporal granularity
| 
|   Formally... (second page, second column, middle)
| 
|   * each hmm is trained by the Baum-Welch algorithm
|   * each layer can be trained and infered independently
|     - lowest layer can be retrained when moving to a new office
| 
|   * two approaches for inference with LHMMS:
|     maxbelief and distributional (...decribe them in detail...)
|     see beginning of page three
| 
| --
| 
|   Decomposition per temporal granularity
| 
|   Layers generate one observation every n time intervals
| 
|   (example), (picture)
| 
|   (lowest level gets the features extracted from the raw sensor data)
|   (any other level gets results from previous level)
|   (the higher the level, the larger the time scale and therefore the
|   larger the level of abstraction)
|   (each layer performs time compression)
|   (n (==T^L) for each layer is determined by intuition sensor
|   signals: 100 miliseconds, outputs of first layer: less than one
|   second, second layer: 5-10 seconds)
| 
| --
| 
| 
|   System called SEER, two layered HMM structure
| 
| --
| 
|   Feature extraction
| 
|   audio: compute linear predictive coding coefficients, use the 7
|     principal coefficients
|     also: other features (like energy)
|     locate source of sound using the Time Delay of Arrival method
|     (more: page 3 and 4)
| 
|   video: 
|     - density of skin color in the image
|     - density of motion in the image
|     - density of foreground pixels in the image
|     - density of face pixels in the image
|     (page 4, third paragraph)
|     (question: why skin color as well as face pixels?)
| 
|   mouse/keyboard:
|     - keep last 5 seconds of mouse and kayboard events
| 
| --
| 
|   Architecture of seer
| 
|   * two-level HMM cascade with three processing layers
|     - lowest layer: capture input data and compute feature vectors
|     - middle layer: two banks of distinct HMMs to classify the audio
|       and video feature vectors
|       - train one set of HMMs: human speech, music, silence, ambient
|         noise, phone ringing, keyboard typing
|       - train the other set of HMMs: nobody present, one person
|         present (semi-static), one active person present, multiple
|         people present
|     - results of the two HMM banks, and mouse/keyboard events, and
|       sound localization passed to the third layer. Also a bank of
|       discriminative HMMs here.
| 
| --
| 
|   Learning Seer
| 
|   Using the Baum-Welch algorithm
| 
| --
| 
|   Experiments: LHMM
| 
|   * tested in multiple offices, with different users, for several
|     weeks.
|   * high-level layers of seer are relatively robust to changes in
|     the environment
|     (only some of the lower-level models needed retraining)
|     (this is the big advantage of the layered structure: less
|     retraining)
| 
| --
| 
|   Experiments: comparison to standard HMMs
| 
|   * concatenate all the feature veector data to a new, large feature
|     vector
| 
|   => Number of parameters to estimate is much higher
|   => Inputs to each level are more stable in LHMMs
|      (because they have already been filtered)
| 
|   => Encoding prior knowledge about the problem in the structure of
|      the models decomposes the problem and reduces the dimensionality
|      of the overall problem.
|      For the same amount of training data, LHMMs have superior
|      performance
| 
|   (not considerably more difficult to determine the structure of
|   LHMMs versus that of HMMs. (only more work on estimating the
|   number of layers and the time granularity of each of them)
| 
| 
| --
| 
|   Results
| 
|   [figure 2 from page 5]
| 
| --
| 
| 
|   One additional set of experiments:
| 
|   test LHMMs against HMMs on 60 minutes of recorded office activity
|   (10 minutes per activity, 6 activities, 3 users)
| 
|   ignore the first few seconds of each dataset
| 
|   used 50% of data for training, 50% for testing.
| 
|   (results in table on 5th page)
| 
| --
| 
|   Conclusions:
| 
|   1. For the same amount of training data, the accuracy
|   of LHMMs is significantly higher than that of HMMs.
| 
|   2. LHMMs are more robust to changes in the environ-
|   ment than HMMs.
| 
|   3. The discriminative power of LHMMs is notably higher
|   than that of HMMs.
| 
| --
| 
|   Summary
| 
| --
| 
|   Security?
