Here is an interesting observation: Ask a child to describe what she sees around her and she will immediately tell you something like “I see a tall man talking to a woman in the driveway in front of a yellow house”. The same task is beyond current computer technology – specifically, feeding a “raw” video clip to a machine and getting back (reasonably quickly) a short textual description of what happens in the clip, is currently pretty much impossible. Images and video are rich sources of information consisting of many different objects (with different shapes and colors) with some relationship to each other, in some environment, possibly moving (in the case of video), etc. – there is a reason that a picture is worth a thousand words. Analyzing images and video to facilitate automatic insights and associated decisions is still incredibly difficult (even offline; doing it in real time is much harder). A further complication is the fact that most of the visual content we view is actually a 2D projection of the real (3D) world. Remarkably, humans are really good at these types of tasks, so one approach could be “Hey, let’s just copy the human visual system” – we’ll get back to this later.
So, what can we do in the area of Video Analytics or Video Content Analysis? – Actually, quite a bit (but nothing like you may have seen in some popular movies) and here are some examples (certainly not an exhaustive list):
- Driven by security and surveillance use cases, many “suspicious” behaviors can be recognized automatically (i.e., with no human in the loop) such as an object that has been left behind, someone crossing a virtual line, people counting, loitering and many others (but probably no more than about 20). Similarly, in the vehicular traffic area, behaviors such as stopped vehicle, someone driving on the shoulder, etc., can be identified.
- Some very specific objects can be recognized – faces, vehicles, license plates and probably a few more. Although some only under limited conditions – controlled lighting, controlled pose, minimal occlusions, etc.
- Tracking of specific objects in the camera’s field of view (tracking across multiple cameras, even when there is overlap in successive cameras, is very difficult)
If your interest is in some specific items on this limited list – no problem, you can buy them from numerous vendors. However, if you are looking for a different behavior or a different object, you will need some computer vision people to develop a new analytic – the generic object recognizer or the generic “tell me if anything unusual happens in this area” do not exist yet.
But don’t despair – Machine Learning approaches are starting to appear in some commercial products. Basically, the machine is trained, for example, on video that represents normal vehicular traffic flow and once the learning phase is over, the machine can indicate that something abnormal has happened such as traffic slowdown due to some sort of incident further down the road. When I say “machine”, by the way, I mean the computer that ingests the video stream and runs the anomaly detection algorithm (which could, in principle, run in the camera itself or very near to it (see my previous post on Edge Analytics).
At this point, you are probably saying, “so what about copying the human visual system?” Well, it turns out the HVS is quite complex and we have not figured out how all of it works yet. A lot of progress has been made over the years and a lot of good research is going on (for example, work at MIT, Penn State and others). One of the exciting developments related to this area are associated with Deep Learning (which really deserves it’s own post…), which is a Machine Learning approach that does really well with tasks where humans are usually better than machines – for example, object recognition in images. DL usually requires a lot of computational resources, which is slowly evaporating as a real hurdle and as a result there have been some really exciting results! Google has recently shown an image with a caption that was caption that was created automatically – this is actually getting us closer to the target of automatic video summarization at the beginning of this post. Another one comes from Microsoft where they managed to outdo humans in an image classification task.
As an aside, some of the world’s top academics in the area of Deep Learning have joined Google, Facebook and Baidu in the last year or two – that should tell you something.
Let me also make the following point – humans are equipped with visual hardware (eyes) that can see in 3D. A lot of stuff gets easier when you also have depth information (e.g., which of two visible objects is in front and which is in back) and there actually are cameras that can record depth information – from stereoscopic cameras to Kinect-like sensors to time-of-flight cameras. In fact, if you have a number of cameras covering the same area from different vantage points you can do on the fly 3D reconstruction (check out FreeD for some real cool clips). Now you can run some advanced video analytics on real time 3D streams to do things that were simply not possible before (or required ridiculous amounts of computational power).
A final thought –
The estimated number of surveillance cameras in the world is about 210 million (obviously, not counting consumer cameras, smart phone cameras, etc.) producing an obscene amount stored video, most of which has never been viewed by anyone and most likely will never be viewed by anyone – there is just too much of it. Only advanced video analytics will be able “to watch the video for us”, letting us know when there is something interesting there. Read less