Abstract
1. Introduction
Smart glasses can be a part of Internet of Things (IoT). Particularly for intelligent buildings and homes smart glasses can play a role of the interface between a user and surrounding intelligence. In such environment a user of smart glasses can control smart objects, acquire information from smart objects, or change their configuration. However, this requires implementing a method of objects identification (“Which object am I controlling?”), generation and processing of graphical user interfaces for objects, proper data exchange protocols, and so forth. Smart objects can be recognized and identified using different methods that can be implemented using smart glasses. In this paper we would like to focus on image-based identification of objects: using markers and based on the visual appearance of objects.
When the smart object is identified the related processes can be started that (1) connect the smart glasses to the identified smart object, (2) present the GUI of the smart object, and (3) execute user's actions. We present the related IoT scenario describing image-based object identification, system architecture, objects representation methods (JSON), and high-level communication (JSON-RPC). We also present results of short user studies on acceptable response times of the object identification process and related analysis of computational performance of selected feature detection/extraction methods used for different smart glasses (Google Glass, Epson Moverio, and eGlasses).
Communication with smart objects and related architectures are subjects of many papers [1–4]. Many authors underline the need of integration of different smart objects into one network or grid. In [5] authors proposed the framework that allows users to register their own sensors into a common infrastructure and access the available resources through mobile discovery. Authors in [6] underline the reconfiguration ability of the proposed smart objects that is based on the context received from the Smart Space. Additionally, the important role of gateways and smart object identification is discussed. The role of data processing middleware based on SOA for the IoT is presented in [7]. Authors conclude that the use of middleware is a good foundation for the integration of diverse networks and better interaction among heterogeneous systems in the future, which simplifies the complexity of the integration process. Similar role of middleware was also analyzed in [8].
Interaction with smart objects starts with the identification of the smart object. Objects are often identified using RFID-tags [9, 10] or using head-mounted sensors [11]. Some authors propose using object identification methods based on graphical features using cameras [12, 13] and smartphones [14]. Different local detectors and descriptors were used for object detection in the context of interaction with smart objects. The Attention Responsive Technology (ART) system proposed in [15] uses eye-tracking and monitors the allocation of visual attention in reference to the local environment. It allows interaction when the users gaze falls on a smart device. Authors in [12] presented another gaze-based system for home automation. They compared “direct” and “mediated” interaction solutions. The authors underlined that often it is difficult to directly control complex devices only by gazing at them. Therefore, PC-based, menu-driven software can be used for such “mediated” interaction. The menu-driven interface can be also controlled by gaze. In [14] authors propose automatic user interface generation on a handheld device using live visual object recognition. Different views (10–15 images) of objects are used in the training phase. The presented system allows differentiation between 8 categories of objects using 128-dimensional SURF descriptor. The processing performance tests run on the tablet computer (1.7 GHz, dual-core processor) showed that it takes approximately 150 ms to process a single frame (up to 7 frames/s).
Objects can be also identified using dedicated graphical markers [16, 17] or active markers [18]. Such methods were also proposed for application of smart glasses in healthcare [19].
In [20] authors proposed a solution for visual attention driven networking with smart glasses (iGaze). Using the eye-tracking the visual attention is captured and the corresponding gaze vector to the visual target is calculated. The user is asked to make a mild head gesture (e.g., head nod) as the postfix of the attention. Together with the head gesture smart glasses emit inaudible acoustic signal to adjacent devices. Nearby devices invoke the phase tracking and device direction determination modules for the estimation of the device vector. The Doppler effect is used.
Dedicated architectures were also proposed for bidirectional interaction between smartphone/smart glasses and smart objects [21].
We can assume the following categories of object suitable for interaction activities:
Passive objects: objects equipped with a wireless interface providing one or two directional transmissions of data (synchronously or asynchronously); for example, turn power on/off and read a value of parameter. It is not possible to provide any parameters to the object or to query the object. Active objects: extended passive objects with possibility of querying the object or providing some parameters (e.g., set temperature to
Additionally, we assume that many “not smart” objects can be transformed into smart objects when extended with additional electronics. Such objects can be additionally named augmented objects.
In this paper we would like to propose the use of smart glasses to collaborate with smart objects in the IoT environment. Additionally, we would like to focus on the analysis of acceptable reaction times of the object recognition system using smart glasses. We will evaluate this using user studies and experiments with Google Glass, Epson Moverio, and the eGlasses platform developed by us. The eGlasses platform (http://www.eglasses.eu/) is developed to provide an open platform with which developers can change some electronics, print another smart glasses cover using 3D printer, add sensors or electrodes, change display, and so forth. The current prototype uses OMAP 4460 processor, 1024 × 768 transparent display from ELvision Company, 5 MPx cameras, different sensors, and extension slots. The Android 4.1 OS and Linux Ubuntu OS can be used for experiments.
2. Methods
2.1. System Architecture and Proposed Interaction Methods
Theoretically, smart glasses can directly discover smart objects in the local environment (e.g., UPnP multicast in local network). However, this assumes specific implementation constraints for (different) smart glasses (e.g., wireless interface and protocols). Therefore, gateways or bridges are often proposed for IoT environments. Also in this paper we propose that architecture consisted of smart glasses, a bridge, and smart objects. In this IoT scenario, smart objects are connected to a bridge (gateway) in the local area network using different interfaces (e.g., WiFi, ZigBee and Bluetooth, and USB). Each smart object is represented in the form of the JSON object and the actual software library (drivers) that enables the control of the object. The library provides the protocol implementation to interact with the smart object on the low or/and high level. The properties stored in the JSON object represent information, mainly about the name/identifier of the object, the wireless interface address, the transport method (e.g., JSON-RPC over WiFi and ASCII stream over ZigBee), the physical location of the object (name and coordinates), the image icon of the object (encoded using BASE64), object detection method (e.g., VISUAL_APPERANCE_ORB, QRCODE), set of descriptors (if required, e.g., ORB descriptors encoded using BASE64), and a collection of available functions, each represented using JSON notation and JSON-RPC2 description. In the proposed system the application server is used in the bridge to process multicast discovery queries (e.g., UPnP) and to interact with smart glasses using REST-based services. For simplicity, we assume that the smart glasses, using a dedicated service, regularly scan available WiFi networks looking for known ISSD identifiers. In our experiments we used ISSD names starting with “eG” letters. When the router/access point is detected a user is notified. When connected, the smart glasses transmit the multicast query (e.g., UPnP), looking for the IoT bridge. Actually, the URL of the bridge service is retrieved. In the next step, the GET request is sent to the URL address of the bridge. As a result a collection of JSON objects is transmitted to the smart glasses (or error/exception is raised). Each JSON object represents a smart object. The initial interaction is presented in Figure 1.

The initial interaction between smart glasses and the IoT bridge.
A user is notified with the list of available smart objects presented as a list of buttons with the name of the object and its image icon. He or she can choose an object from the list (then the dedicated GUI will be generated) or can switch to the mode of the automatic identification of objects. Then the camera is started detecting either graphical markers (e.g., QR-code) or visual descriptors (see Figure 2).

(a) The smart object can be identified using graphical markers or visual features. (b) The eGlasses platform was tested in the smart home, iHomeLab, Switzerland.
The related GUI is presented to the user when the object is selected from the list or is detected and identified using the camera. Selected properties from the related JSON object (object name, image icon) are presented. Additionally, the available actions are represented by GUI widgets: a button (with the method name, e.g., POWER_ON) and additional widgets if required (e.g., ImageView for image data, e.g., frames received from the connected camera of the smart object). The action listener is automatically generated for each button. The implementation of the listener is generated based on the transport mechanism described in the JSON object. For example, if the transport mechanism is set to JSON_RPC_OVER_WIFI, the JSON RPC2 request is sent to the bridge (Figure 3). The bridge processes the request for the particular smart object executing related code.

The bridge can be a middleware between smart glasses and particular smart objects.
As a result, smart glasses can receive the status of information processing (e.g., success code) and results (data). The received information is presented in the dedicated text view (e.g., Android TextView) widget. The special form of the request is the subscribe request. As a result of this request, the smart objects automatically and systematically (asynchronously) send new data when available (e.g., power consumption values from the smart power e-socket). The described procedure assumed that the remote method (the service of the smart object) does not use any parameters. However, if the method requires the parameter value, then CheckBox or EditText widgets could be generated for the GUI.
Smart glasses do not use typical data entry procedures. Some dedicated text/data entry methods can be used. Possible solutions include the application of the accelerometer [22], smart fabrics [23], or eye-tracker [24]. Other methods and comparison study were presented in [25]. In this study we verified the fundamental interaction procedure using the eye-tracker, which is a part of the eGlasses platform.
2.2. Interaction Using Eye-Tracking
The eye-tracking technology is not new; however the combination of smart glasses with eye-tracking opens new possibilities for human-system interaction. In this study we verified the possible use of eye-tracking algorithms developed earlier [26] for the interaction with graphical user interfaces for the control of smart objects.
In general the eye-tracking module is designed to operate in 1 of 3 main modes as presented in Figure 4. The primary mode is the one that enables controlling the variety of devices and applications of the eGlasses using interaction with the near-to-eye display of the eGlasses. The other two modes allow gaze tracking (scene analysis) and communication by gaze with an external computer/display.

The main operation modes of the eye-tracking module.
In this study, the first mode is used. The implemented pupil tracking algorithm provides the position of the pupil center while the transformation algorithm relates this data with the graphic content displayed on the near-to-eye display (Figure 5).

The point-of-gaze estimation in the near-to-eye display mode.
The proper use of the eye-tracker requires performing the calibration procedure. The subject looks at a series of target, calibration points, while the eye-tracker records coordinates corresponding to each gaze position. The calibration points are spread at corners of the near-to-eye display. The eGlasses platform is equipped with the display of the native resolution of 1024 × 768, but it can be used with different smaller resolutions, for example, 800 × 600, 800 × 480. To eliminate the possibility that a user can omit some points during the calibration procedure the 640 × 480 calibration board is displayed in the center of the near-to-eye display as shown in Figure 6. The 640 × 480 resolution was chosen as the working resolution of the eye-scanning camera. It provides satisfactory gaze-point detection results with limited performance/power requirements [24]. After the calibration procedure, the transformation matrix is calculated in regard to the relative position of the test board's corners.

Example of the calibration board displayed during the calibration procedure. (a) Basic idea, (b) practical use example: the first test point is shown.
The successful calibration allows mapping of the pupil position into the near-to-eye display coordinate system. For example, it gives a user the possibility of controlling the mouse cursor by gaze [26]. The user can interact with GUI by simply looking at it. The idea of interacting with recognized object is to present graphic content (buttons, widgets) around detected objects to enable sending commands using the displayed GUI. Therefore, the interaction with smart objects requires both selection and confirmation of desired commands. In this study relating the selection of graphical component directly to the gaze position (movements) and using fixation within the region of interest (dwell time) for confirmation were decided.
2.3. Implementation
The prototype of the system for interacting with smart objects was implemented mainly using Java programming language. The software for smart glasses was implemented using Android SDK and OpenCV (using both Java and native code by the Java Native Interface, JNI). The services of the bridge were implemented using Jetty Application Server with JSON and JSON-RPC2 libraries. Additionally, low-level software was used to process stream data from ZigBee (using ZigBee for serial conversion). A laptop computer was used as a bridge. Three categories of smart objects were tested: a controllable power socket with a ZigBee interface [21], a Philips Hue system with the propriety bridge (http://www2.meethue.com), and the Parrot Jumping Sumo robot (http://www.parrot.com/usa/products/jumping-sumo/). The bridge directly controls the power socket. For the Philips Hue system the bridge provided mapping of JSON messages between smart glasses and the Philips Hue system. For the Parrot robot the bridge delegates the connection to the robot. Smart glasses execute downloaded activity using Android Intent method. To control the robot we used libraries and codes provided by producers.
The experimental prototype of the eGlasses platform was used to perform experiments for possible interactions with near-to-eye display using the eye-tracing module (Figure 7). The eye-observing camera was located below the display and was used to track the pupil position in reference to the coordinating system of the display.

(a) The experimental prototype of the smart glasses with the eye-tracker. (b) The eye-observing camera located below the display.
3. Experiments
Three categories of experiments were prepared. In the first one we were interested in acceptable response times between the beginning of looking at a controllable object and the point of time when the object should be identified. Therefore, we asked 22 volunteers (avg. age
In the second group of experiments we verified the computational performance (frames per second, FPS, versus frame size) of different smart glasses to evaluate potential response time of object detection algorithms. We used Google Glass, Epson Moverio, eGlasses, and the Galaxy Note 2 smartphone as a reference. Three tests were executed for different feature detection/extraction methods (ORB/ORB, FAST/FREAK, FAST/BRIEF): (1) feature detection/extraction time using smart glasses, (2) object detection time (description, matching, and minimal distance classification for
Experiments were performed using two sets of images, one consisting of images with the resolution of 1280 × 720 and the second with 800 × 480. Each set contained 40 images (32 resampled images from the reference set described in [27–29]) and 8 images of controllable smart objects (e.g., a lamp, a Parrot Jumping Sumo robot, and a TV decoder connected to the e-socket). Since we were focused on the time analysis the actual content of images in this experiment was not very important (except the complexity of features). Much more important was the number of descriptors used in the performance analysis. After key point detection best key points were sorted and two sets were generated with 100 or 500 best key points. For simplicity the spatial distribution of points was not analyzed. All further matching/detection experiments were performed in reference to sets about 100 descriptors and 500 descriptors generated for chosen key points. The Brute Force-Hamming method was used. The used smart glasses and the smart phone had different Camera Preview resolutions (Google Glass 640 × 360, Epson Moverio 640 × 480, eGlasses 1024 × 768, but reduced to 800 × 480 in experiments, and Samsung Galaxy Note 2 1280 × 720) limiting the size of programmatically captured video frames. Therefore, in one experiment, each device was capturing frames, but for the processing predefined images from the reference sets were used. Since the resolution and descriptor size were known it was better to compare the results.
Additionally, to show the trade-off between runtime versus accuracy of the used features, a simple experiment was designed. Ten actual and potential smart objects were used to perform real experiment: 2 driving robots (SO1, SO8), 2 lamps (SO6, SO7), 2 radio sets (SO2, SO3), 2 duster robots (SO4, SO9), a humidifier (SO5), and a home printer (SO0). In Figure 8 pictures of objects are presented with rendered locations of key points for features detected using the ORB algorithm (features were selected inside manually drawn rectangle during the preprocessing step). Images were acquired from a distance of 1 m (±0.3 m) with the resolution of 800 × 480. Descriptors were calculated using ORB, FREAK, and BRIEF algorithms. The number of descriptors for each object was as follows: (a) for the ORB algorithm: SO1-500, SO2-500, SO3-500, SO4-500, SO5-326, SO6-401, SO7-500, SO8-497, SO9-409, and SO0-500; (b) for the FREAK algorithm: SO1-477, SO2-500, SO3-485, SO4-476, SO5-264, SO6-463, SO7-443, SO8-490, SO9-328, and SO0-498; (c) for the BRIEF algorithm: SO1-467, SO2-493, SO3-500, SO4-475, SO5-283, SO6-488, SO7-465, SO8-428, SO9-361, and SO0-487.

Smart objects images with key points of ORB/ORB descriptors ((a): SO1, SO2, SO3, SO4, and SO5; (f): SO6, SO7, SO8, SO9, and SO0).
The Android software was modified to measure detection accuracy and detection time. Only for experiments, it was assumed that when the working camera (i.e., continuously capturing frames) is targeted at the object the operator manually starts (using a tap event) the measurement. When the object is detected the current processing is paused, and the related information is presented on the microdisplay: picture of the detected smart object (to check if the object was correctly detected), time period (from the first frame captured after the user had invoked the tap event until detection), and the number of frames (from the first frame captured after the user had invoked the tap event until detection). Smart objects were located on the floor (e.g., robots) or on the table (e.g., lamps, printer). The user was sitting on the swivel chair observing objects from similar distance as used during the generation of the reference set. In this experiment we did not analyze the influence of distance or viewing angle. All previously described feature extraction algorithms were tested. First testes were used for 5 objects in the reference set (SO1–SO5) and then for 10 objects in the reference set. For the ORB algorithm, tests were also executed for 20 (10 SO + 10 other) and 40 (10 SO + 30 other) reference pictures. Additional pictures (other) were taken in a room (books on the shelf, etc.) selecting those with 500 features. The pictures were added to simulate larger reference set for performance tests. In these tests the false positive result was defined as the wrong object detected within 10s. The false negative result was defined as a lack of proper object detection within 10 s. Five detection attempts were used (325 all together) for each method (ORB, FREAK, and BRIEF), for each object (SO1–SO0), and for each configuration (O = 5, O = 10, etc.).
The quantitative analysis of compression influence on the quality of features was also performed. Pictures of SO1–SO5 from the reference sets were used to produce compressed version of the original pictures using 4 different compression/quality factors: 90%, 80%, 70%, and 10%. The
In the third group of experiments, we verified the use of eye-tracking module for the needs of interaction between a user and the GUI presented on the near-to-eye display.
The GUI for tests was constructed using 16 regularly distributed widgets. Each widget was designed to register and notify a user of its current state. It included the “onGazeOver” state (state 1) that corresponds to the event, when gaze is focused on the current widget and the “onClick” state (state 2). The “onClick” event is fired when the “onGazeOver” state lasted at least

Different layouts of the GUI used in experiments. The active areas are marked using yellow color: (a)
Our intention was to evaluate these designs in terms of the rate of correct selections of GUI elements and to check the optimal dwell time. To do so, the software randomly highlighted one of the active areas. The user's task was to select and confirm this element by gaze. The new area was highlighted every time the confirmation signal was received whether the user focused on the right element or not. The test was finished after receiving 100 confirmation signals. Three different dwell times were tested (500 ms, 1000 ms, and 1500 ms) for each prepared layout. Each test was repeated five times preceded every time by the calibration procedure. Three parameters were recorded during each event: the selected element's name, the ID, and the result of confirmation (HIT or MISS).
4. Results
The implemented prototype of the system was verified using experiments in smart home laboratories: iHomeLab in Switzerland (Figure 2(b)) and the AAL laboratory in Poland.
In Figure 10(a) an example of the object in typical user settings is presented. Another example of the graphical user interface generated for the recognized, augmented object connected to the power socket is presented in Figure 10(b). When the smart object is identified (TV decoder) the GUI is automatically presented. In this case the user can turn on or off the power and observe related power consumption parameters.

(a) An example of the object in typical user settings. (b) An example of the GUI for the power socket control.
In Figure 11 example of image from the robot interface is presented. The control of the robot is performed using the accelerometer sensor of smart glasses: head up: forward; head down: backward; head left: turn left; head right: turn right; head shake: jump. Additionally, the jump control is possible using the magnetometer sensor and the magnetic ring. When the magnetic ring (finger/hand) is reaching the eGlasses (magnetometer) the change in the magnetic field is detected and the related event is fired. The experiments were performed for the monitoring of possible fall of the elderly person.

Remote control of the robot using smart glasses: the image is transmitted from the robot to smart glasses. The possible application includes the use of robot for remote monitoring of dangerous events, for example, fall detection.
The control of the Philips Hue lamps was performed based on recognition of identifiers of lamps encoded in QR-codes attached to lamps.
In this paper we are mainly focusing on feasibility of image processing for object detection using smart glasses. Therefore, below, the related results are presented.
4.1. Acceptable Response Time
The results of this experiment were very interesting. The average acceptable response time was 2.62 s ± 1.6 s. The difference between men and women was observed. The result for females was 3.12 s ± 1.91 s, while for males it was 2.11 s ± 0.99 s. Particularly women at middle age (>45) accepted longer response times. Both women and men accepted longer response for computer screen, probably because of the active content. The shortest acceptable response time was 0.87 s (man, 32 y) and the longest one was 6.92 s (woman, 48 y). The average accuracy of time measurement by the operator was calculated as 0.103 s ± 0.026 s (the average response for a sequence of fast 10 START/STOP commands).
4.2. Data Processing by Smart Glasses
First the number of frames per second was calculated for each device without any frame processing tasks. The value was calculated after capturing 110 frames (10 frames skipped and then 100/time between the first and the last frame) in three trials. The results are as follows: Google Glass: 33.11 FPS (640 × 360; 7.28 Mpx/s); Epson Moverio: 25.0 FPS (640 × 480; 7.32 Mpx/s); eGlasses: 29.89 FPS (800 × 480; 10.95 Mpx/s); and Samsung Note 2: 16.6 FPS (1280 × 720; 14.59 Mpx/s).
Frame processing times for key point's detection and feature description are presented in Table 1 (simulated frame size 1280 × 720).
Values of FPS for different algorithms and devices (EG: eGlasses; N2: Samsung Note 2; EM: Epson Moverio; GG: Google Glass). The set of key points was reduced to 500. Frame resolution: 1280 × 720.
Of course for smaller resolution of frames performance results were better. For example, for the eGlasses, processing of images with resolution of 800 × 480 for FAST/BRIEF descriptor was performed with 15.03 FPS rate (but for ORB only 4.24 FPS). The results mentioned in Table 2 were obtained for the matching of descriptors and actual object detection. In this paper only results for eGlasses (processing using algorithms implemented using JNI) are presented later.
FPS for object detection using different algorithms, different number of reference objects (O), and different size of features (F) in the representation of each object (res. 800 × 480).
For frame resolution of 1280 × 720 results were up to two times worse. Another method for object detection in IoT environment using smart glasses could use bridge/gateway for actual data processing. Therefore we tested the transmission rate for frames acquired by smart glasses. First, transmission speed was estimated sending data packets of different sizes during each
The results of the experiment with 10 smart objects (SO1–SO0) are presented in Table 3. In each cell of the table the calculated precision value is presented (for 5 attempts, precision = TP/(TP + FP)) and average detection time is in milliseconds.
The results of the experiment with 10 smart objects (SO1–SO0).
In the entire experiment 3 FP and 1 FN results were observed for the ORB algorithm, 5 FP and 1 FN results for the BRIEF algorithm, and 6 FP and 1 FN results for the FREAK algorithm. The average number of frames to detect the object (without FP and FN results) was
In Table 4 some results of the quantitative analysis of influence of compression on the quality of features are presented.
Results of the quantitative analysis of influence of compression on the quality of features.
In general, images compressed with the 90% quality factors have about 50% of all descriptors almost identical (distance < 3) or very similar (~80% with distance < 6). For images compressed with 80% quality factor about 33% of all descriptors were almost identical with those calculated for the original image (distance < 3) or very similar (~71% with distance < 6).
In Figure 12 some examples of the matched features (distance < 6) are graphically presented between the compressed picture (left) and the original one (right) for the SO1 object. Images with two different compression factors are presented: quality 90% (Figure 12(a)) and 10% (Figure 12(b)).

Some examples of the matched features (distance < 6) between the compressed picture (left) and the original one (right) for two different compression factors: (a) quality 90% and (b) quality 10%.
4.3. Interaction Using Eye-Tracking
The possible use of the eye-tracker for the interaction with GUI of the smart glasses was verified using experiments described in Section 3. First test was performed using dwell time set for 1500 ms. The user was asked to calibrate the eye-tracker and then the first designed layout (
The accuracy of correctly confirmed selections obtained for dwell time = 1500 ms.
The accuracy of correctly confirmed selections obtained for dwell time = 1000 ms.
The accuracy of correctly confirmed selections obtained for dwell time = 500 ms.
The pseudorandom generator was used to highlight widgets (user stimulation) for each test layout. In Figure 13 the distribution of generated events in tested layouts is presented. The figure presents how often each active field was highlighted for the particular layout.

The distribution of randomly generated events in the tested
5. Discussion and Conclusions
The average acceptable response time for volunteers was about 2.6 s. However, the performed experiments showed that the acceptable time differs between participants. Discussions with participants after the experiments showed that such acceptance levels might depend on the physical and mental state of a person, hour of a day, and so forth. What was interesting was that many participants underlined that they would not expect the fastest response of the system, but response time should be in a preferable range.
In the experiments with object detection algorithms standard OpenCV implementations of key points detection and feature extraction/description algorithms were used. Sometimes these implementations differ from the implementations provided by authors of algorithms (e.g., therefore we did not use the BRISK algorithm).
The interaction with GUIs of smart objects presented on the smart glasses display was performed using a mouse wirelessly connected to the eGlasses. Additionally, the eye-tracker was tested.
In this study we were not primarily focused on the evaluation of the precision and recall of particular object detection algorithms. This can be found elsewhere (e.g., [28, 30]). However, we did simple experiments to verify the methodology for the optimistic assumptions: short distance to objects (up to 1.3 m), one observation angle, and similar lighting conditions. We tested the system for ten smart objects. The results showed good precision for the tested methods. The analyzed processing times of the methods were acceptable (response time < 2.6 s). The worst results were achieved for the FREAK algorithm (6 false positives, 1 false negative, and longest processing times). Excluding FP and FN cases objects were detected in first 5 frames. Only for some cases it required more frames to detect the object (even 21 frames). For all 325 trials only 3 FN cases were detected. The false negative result means that the object was not detected with 10 s. Such cases (FN or many frames to process) were sometimes related to the camera auto settings (auto zoom, auto exposure). In this experiment we used preprocessed single images of smart objects to calculate descriptors and build the reference set. However, relatively good results of the processing times suggest that more images for each smart object could be used, especially when the number of smart objects is limited. In conclusion, when smart glasses are used for the detection and control of small number of smart objects it is possible to process frames with the rate allowing object detection below the average acceptance response times specified by almost all participants of the initial study.
In general, the results related to the detection accuracy would not be so good if we take into account different angles of observation, different distances, changing illumination, camera properties, and so forth. Those are very well-known limitations of object detection methods based on visual appearance of objects. However, this requires larger study and it will be a subject of another analysis we will perform.
For more objects to be detected (e.g., in a large building) a proper bridge/gateway can provide the service of object detection. The very important goal of this paper was to analyze the performance of two setups for object detection: by smart glasses and using external service (a gateway). Standalone smart glasses (i.e., not glasses connected by a cable to a raspberry pi or other external computer) are technologically limited because of acceptable thermal emission of the electronics, limited power capacity of batteries, and so forth. These technological features are design limitations in the choice of processors. If more processing is required (e.g., more smart objects) object detection procedure should be performed mainly using external services. However, the question is how to efficiently provide data for such external analysis. One possibility is to process each frame to detect descriptors and transfer those descriptors to the gateway for further object detection steps (e.g., feature matching). Another possibility is to transfer each frame to the gateway. In this paper we performed many experiments to analyze this problem analyzing different algorithms, different image representation methods (lossless and lossy compression), different color models (e.g., use only
There are also other practical aspects important for the future use of such interaction methods. For example, if in the field of view of the camera there is more than one smart object then only one object (one nearest neighbor) will be detected. Future solutions can offer a menu with all detected objects in the current view, and the user could choose the proper one. Another possibility is to use the eye-tracker to select the detected object (e.g., select this object for which gaze is focused inside the rectangle containing the object). Similar, gaze-based procedure can be used if more than one similar object is present in the camera view (e.g., two identical lamps).
The gaze-based interface for the interaction with the near-to-eye display was tested for possible use in simple gaze-based actions. The design of the interface with the camera and the display built into the frame of smart glasses potentially excludes the negative influence of head movements. That is because the user's head and utilized hardware remain in constant geometrical relation. In this study we assumed that the smart glasses are well fixed on the head and there are no rapid movements of the head that could lead to the change of the smart glasses position in reference to the user's head. The main challenge was the accurate pupil detection. It is essential for the reliable transformation from the pupil position to accurate and precise “cursor” position on the near-to-eye display. The used eye-tracking algorithm was previously described in [26] and the study dedicated to the accuracy analysis in reference to near-to-eye display was presented in [24]. Here we analyzed the possible use of events that are related to eye movements for the interaction with a simple GUI of the smart object. Each tested layout contained eight active GUI components that cover approximately half of the available display size. Such an approach simulates the situation when a user observes the smart object and can interact with it by sending commands controlling GUI widgets/buttons displayed around or next to the object icon (or preview). Different confirmation actions are possible to model using the eye-tracking module. Some examples include “eye blink” or “gaze-fixation for
In [31] authors identified two distinct styles of smart object sensing: object-centric style and human-centric style. Smart objects belonging to the object-centric type are deployed in the real world and can detect changes in their physical status or/and changes in the surrounding environment. Smart objects from the second category serve as personal companions. Smart glasses or smart watches belong to the second category of smart objects [32]. However, as shown in this paper, smart glasses can cooperate with smart objects located in the user's neighborhood. Using discovery services and the shared protocol smart glasses can potentially connect to different networks or objects, especially when the user changes his or her location. This gives many interesting opportunities in different fields of applications for smart homes, intelligent buildings, smart cities, and so forth.
