💾 Archived View for gemini.spam.works › mirrors › textfiles › programming › gesture.txt captured on 2020-10-31 at 14:43:09.

View Raw

More Information

-=-=-=-=-=-=-

Power Glove Gesture Recognition

Foreword:

	Perhaps everyone who has used the Nintendo Power Glove, whether it be for 
entertainment or development purposes, has gained insight into its usefulness as a gesture 
sensing device - something that "knows" what you are trying to say with your body.  Of 
course, even the Power Glove is limited to hand and wrist input, but it is clearly a far more 
natural input device than the keyboard, mouse, or even the touch screen.  To be more 
quantitative, the glove offers fully 8 channels of completely independent input, as opposed 
to 4 for the mouse or joystick (X, Y, and 2 buttons) and (essentially) 1 for the keyboard.  
It is as though you are playing an 8 string guitar that your computer can "hear" perfectly, 
(although some of the strings only have 4 frets.)  What I want to implement is a program 
to take the flood of glove input and boil it down to easily-interpreted commands like 
"punch", "move left", "move forward", "twist left", "twist right", "fist", and so on and so 
forth.  It is much easier to write a useful application when this filtering has already been 
taken care of.  I see the first and most important application of this implementation to be 
virtual reality.  Simply put, VR is the attempt to make a computer "look and feel" more 
like real life and less like a stupid machine.  The glove puts people in closer contact with 
computer mechanics, and perhaps more importantly, it puts the machine in closer contact 
with the human user!  This much power will probably find other applications as well, and I 
list some potential ones later in the text.
	This paper gives a brief description of how I will implement an excellent gesture 
recognition system.  At this point I'm not asking for input on what you think of this, so if 
you don't like it, ignore it.  The discussion is very technical and contains a lot of "stream of 
consciousness" writing.  If this bothers you, I'm sorry, I don't have a lot of time to polish it 
at this point.  I need to start coding as quickly as possible.  If it bugs you that I'm releasing 
this before the code is done, well, read the afterword, then go back to your private, selfish 
excuse for a life.  The rest of us are getting down to work.

Section I - How simple should recognized gestures to be?

Fingers:  Thumb, Index, Middle, Ring.
Location: X, Y, Z, and Rotation.

Gesture consists of finger "delta" and/or location "delta" ("delta" = "change in glove 
input").  More subtle observation - we may wish to ignore the delta of any of the eight 
parameters.  Do finger deltas need to be handled differently?  (My answer turns out to be 
an emphatic "NO".)  Big issue is one of gesture "combination".  Should two identical rapid 
gestures be interpreted as a "double-click"?  I think so, (though later I'll reject this.)  The 
concept can be extended like so:  Say the gestures listed in the 1st paragraph can be 
sensed.  If several gestures occur rapidly, there might be a gesture in the event queue like 
"punch-twirl-punch-punch"  which would be distinguished from "punch", "twirl", "punch", 
"punch" (4 separate queue events) by the delay between the gestures.  Clearly a queue of 
gestures would be required, similar to an event queue for mouse clicks.
	Another possibility is the "shifting" of gestures.  That is, if a gesture does not use a 
parameter delta (in other words "an input channel is ignored in a particular gesture"), that 
parameter could be used as a kind of "shift key".  The most logical extension would work 
like this:
  - User picks a "shift" gesture, such as holding the thumb close to the palm.
  - The shift gesture parameter (in this case the thumb delta) cannot be used in any of the 
"regular" gestures.  (This is to prevent "overlap" see below for more.)
  - Now any regular gesture can be shifted, giving twice as many fundamental gestures.
  - The # of fundamental gestures could in fact be doubled a few more times by adding 
additional "shifting" gestures in the manner of the <Ctrl> and <Alt> keys on most PC 
keyboards.  (One idea I had was to change the meaning of gestures based on the 
"quadrant" that the user is pointing at - upper left, upper right, lower left and lower right.  
E.g. "fist" can be broken into "upper left fist", "lower right fist", etc.)  HOWEVER, using 
shifting gestures encourages modality in the client application, which is usually very 
frustrating to the end-user learning an application.  The human mind, really, only operates 
in one "mode".  I imagine that this is disturbing to some of you.  User interfaces have 
progressed past the "insert mode," "delete mode", "change mode" type.  Equally offensive 
to me are the "just hold down the shift to key to modify commands in such-and-such a 
way."  (E.g. shift-left-mouse-button to copy, ctrl-left-mouse-button to move, or Shift-F10 
to print and Ctrl-F10 for print preview.)  The human mind can't handle layers like that, but 
the computer will store them forever.  How many times have you moved a file when you 
wanted to copy it, or printed a file when all you wanted was a preview?  This happens 
most frequently in beginning users, and users who have not used a program in a while.  So 
those of you who think that "shifting" gestures is neat had better grow up.  If I choose to 
implement them, it will be for a better reason than that 
  -  Another basic problem is that to be effective, the shifting gestures must lock out an 
entire channel of input which may limit the variety and intuitiveness of the "unshifted" 
gestures.  (What's the difference between a "shifted-fist" and an (unshifted) "fist" if the 
shifting gesture is the thumb tucked into the palm?)  More and more I'm thinking one shift 
gesture is probably too much.  Might want to make it optional for certain applications.
	It would be useful for the gesture recognition system to do automatic "globbing".  
E.g. the "punch-twirl-punch-punch" command would automatically be shortened to 
something like "knockout".  This is different from gesture combination in that it does not 
directly involve time.  Example - "punch" followed later by another "punch" could be 
globbed to "twopunch".  This way, "punch", "knockout", and "twopunch" would all be 
separate gestures, even though they all involve a rapid thrust toward the screen.  The 
client app would not need to do any special interpretation to receive them correctly.
	There is more about shifting, combination, and globbing later in this article.

Section II - Uses of this implementation

	The intended implementation would allow the end user to define the gestures 
he/she is most comfortable with.  What the computer is supposed to do when a gesture is 
received is not the subject of this article.  There are a myriad of applications that could use 
gesture receipt to trigger a function.  Ideas:
	Virtual Reality
		Communicating between multiple users
		Constructing rooms & objects
		Movement within the world
		"Training" of autonomous moving objects
	Disabled persons
		Simplified communication (speech synthesis, text generation)
		Therapy, muscular exercise
	Windowing/User Interfaces
		Sizing, moving, selecting windows and data chunks
	Education
		Training for sign language
		Visual computer programming
		Introductions to computers
	Multimedia applications
		Moving between subjects.
	Games
	Control of unusual peripherals (robotic arms)

	Perhaps with the glove and good gesture recognition, we can break out of "flat" 
computer interfaces and put computers within the reach of more people.  Keep your 
fingers crossed!

Section III - The implementation itself

	We need a way to "name" the gestures without using the keyboard.  One way is to 
have a built-in gesture set designed for entering alphabetic characters (similar to the way 
hearing-impaired people do proper names).  Another way is to just have a long list of 
gestures that can be assigned, and you just select the gesture you want to define menu-
style.  I think we want a combination of the above - I'll supply a list but users can also add 
their own.  Keep in mind that these ASCII names do not go into the event queue, they are 
there solely for the convenience of the application.  I imagine that once the gestures are 
loaded from disk and any modifications are made by the end user, the names can be 
removed from memory.  They only need to be reloaded if the user wants to change the 
definition.  The app refers to the gestures via a pointer (or "handle").  (Side note: I've read 
some discussion on Compuserve about whether to use #defines or constant strings to refer 
to the gestures.  NEITHER OF THESE METHODS IS FLEXIBLE ENOUGH.)  Clearly 
I need a way of saving/loading gestures/gesture-sets to/from disk.  I will avoid "binary 
dumps"; however, don't expect them to easily interpretable.  (Although you never know!)
	Okay.  I am assuming that the glove will be sampled at regular intervals.  The 
length of this interval is not terribly important, but most applications are going to need 
real-time input.  When I refer to a "click" or a "time click" below, I mean the time between 
glove samples.  I suppose you could still do gesture recognition if the glove is sampled 
irregularly, as long as the time between each sample is recorded, or maybe a time stamp on 
each sample or something like that.  It would still be a pain in the butt compared to regular 
sampling.  Basic objects are shown below:


Sample:
	One UNSIGNED byte for each of the following X, Y, Z, Rotation, Thumb, Index, 
Middle, Ring.  (Practically speaking, the straight glove data.)

Gesture:
	1 word pointer to a Recognizable.

Recognizable:
	One SIGNED byte for each of the following: delta X, delta Y, delta Z, delta 
Rotation, delta Thumb, delta Index, delta Middle, delta Ring
	8 bit value with each bit signifying whether the corresponding "channel" is to be 
used or ignored for the gesture.
	One size_t time (in clicks) for the length of the gesture.

EventQueue:
	A queue of Gestures.

SampleBuffer:
	A random access array of Samples (updated after each click.)  Must hold as many 
samples as the largest Recognizable in the GestureList.

GestureList:
	A sorted list of Recognizable objects.  The sorting key is the length of the gesture.


	The main engine works like so:

1. A new Sample is added to the SampleBuffer.
2. Step through the GestureList:
	a.  Compute a delta vector for the current Recognizable, (if it is not the same as 
the last one.)  This is done by subtracting each of the current Sample values from the 
Sample values N clicks ago (which will be present in the SampleBuffer), where N is the 
length of the current Recognizable.
	b. See of the delta vector lies within Epsilon of the current Recognizable's delta 
vector.  Epsilon is a constant vector to allow for "close enough" user input.  I will supply 
the appropriate Epsilon vector for the Power Glove after experimenting.
	c.  If 2b. is true, add the address of the Recognizable to the event queue.  The 
address of the Recognizable is in fact a Gesture.
3.  Continue forever!

	Hmmm.  This isn't quite right.  Depending on how quickly the user makes a 
gesture, it might make it to the queue several times, instead of just once as hoped.  We 
need to amend the data structures.



Recognizable:
	Same as above, plus an 8 byte "work vector" and a 1 bit flag to say whether the 
gesture is currently being sensed (the "sensed bit").

Gesture:
	Same as above, plus a 1 bit value that is TRUE when the gesture is first 
recognized, and FALSE when the gesture has been released.

In case you can't tell, event queue messages now take the form "punch(sense)" followed 
later by "punch(release)".  This is similar to the way mouse "click-hold-and-drag" 
operations - one message is sent when the gesture is first "seen" and another is sent when 
the gesture is finally released.  Hmmm, rather than setting and resetting a bit to indicate 
whether the gesture is turning on or off, why not just have the client app assume that 
identical queue messages will be sent.  The first one will always be the "turning on" 
message, and it must be followed by another identical "turning off" message at a later 
time.  But the messages may not occur one right after another.  Does that put too much 
responsibility on the client?  Hmmm.  It would be nice to have just a single pointer in the 
queue, but I'm leaning toward keeping the extra bit.  Many client apps will want to ignore 
the "release" message, and they would just have to check that bit to distinguish the 
irrelevant messages.  Otherwise they'll have to keep track...
	I'll modify the engine as follows:

2. Step through the GestureList:
	a.  Compute a delta vector for the current Recognizable, (if it is not the same as 
the last one.)  This is done by subtracting each of the current Sample values from the 
Sample values N clicks ago, where N is the length of the current Recognizable.  If the 
"sensed bit" is set, subtract the current values from the work vector (rather than from the 
sample N clicks ago.).
	b. See of the delta vector lies within Epsilon of the current Recognizable's delta 
vector.  Epsilon is a constant vector to allow for "close enough" user input.  I will supply 
the appropriate Epsilon vector for the Power Glove after experimenting.
	c.  If 2b. is true, and the "sensed bit" is not set, then set the "sensed bit" and add 
the address of the Recognizable to the event queue, making sure it is a "turning on" 
gesture.  Also, copy the Sample from N clicks ago into the work vector!
	d.  If 2b is false, and  the "sensed bit" is set, then send a "turning off" queue 
message.
3.  Continue forever and ever!


	It should be plainly obvious how users can create their own gestures using JUST 
THE GLOVE!  They'll hit one of the glove buttons to start a gesture, make the gesture, 
then hit the glove button again.  All you need to store are the sample deltas and the length 
of the gesture.  What could be easier?
	Glove gestures as I have defined them are rather "elemental".  It is not possible to 
define a gesture like "move hand up, wiggle your index finger, move toward the screen 
and twist your wrist left"  However, you COULD define 4 or 5 elemental gestures that 
would be "added up" by the client application and interpreted as a single gesture!  I will 
include one or more relatively simple examples of this with the code, but I'm not sure if 
they'll be the combination or globbing type.
	That about covers it.  I'm still not sure about shifting, combination, or globbing.  I 
have some deeper questions about these concepts which I have not really addressed here.  
If you have understood everything I said, you probably have the same questions!  I'll 
probably use Borland's C++ CLASSLIB library to handle the queue, buffer, and array in 
the first cut.  I'll probably release that mainly to show beginning OOP programmers how a 
class library is used.  (I know there's a lot of you out there.)  But the real version will not 
use CLASSLIB for three reasons: 1) Portability, 2) Execution speed, and 3) Memory 
usage.  Good reasons don't you think!  Only disadvantage is perhaps maintainability.  But 
this is a REAL TIME application.  Compactness in both speed and size are penultimate.  I 
will promise you that it will be object oriented as well.  I really want this thing to reach a 
wide audience.  If you doubt my credentials read MTP.BIO in the COMART forum on 
CompuServe.

Afterword:
	The system will be freeware with the provision that you give me credit if you use 
any part of the source code for any purpose.  It will be copyrighted.  It is my opinion that 
good software is only of value to the intellectual community when it is accessible.  I am 
making every effort to put this tool in a position where it will be used.  Please use it!  In 
the spirit of Richard Stallman's work, I am permitting you to take advantage of my ideas.  
If you make any significant improvements to the engine, please let me know.
	If my gesture recognition system sounds good to you, here's how you can pay me 
back.  Think about what gesture sets will be useful to you.  When the code is done, use it 
to create the gesture sets you want.  But I WANT TO SEE WHAT YOU HAVE DONE.  
This is only fair.  You may sell your application if you must, but you still owe it to me to 
let me have your gesture sets.  It's the only way I can see the effect of my effort.  It will 
encourage me to stay focused on this important area.  There's obviously no way I can 
force you to do this without investing big $$, which I don't have.  I just want you to "Be 
Nice to Me" as Todd Rundgren so aptly put it (in 1971).
	Also, if you're willing to hire me to work on Virtual Reality construction tools or 
let me help in constructing actual Virtual Worlds, please contact me (see below).  My 
present job is very boring, and this kind of stuff is one of the few things which stimulate 
me.



			Mark T. Pflaging

Home:
	7651 S. Arbory Lane
	Laurel, MD 20707
	(301)-498-5840

Work:
	Cambridge Scientific Abstracts
	7200 Wisconsin Avenue
	Bethesda, MD  20814