I’d expected a near-empty Berkman Center for the first luncheon talk of the year, with many of my fellow fellows heading north to New Hampshire to witness the sausage-making that is the democratic process. But Dr. Deb Roy of the MIT Media Lab packed the house with a discussion of his unusual experiment, the Human Speechome Project.
Berkman fellow and MIT professor Judith Donath acknowledged the election obsession at Berkman by suggesting that all the political chatter in New Hampshire should force us to think about the roots of speech. Roy’s work focuses on how people learn speech, and suggests implications for mechanical approaches to speech. She notes that there’s an open question in cognitive science and robotics as to whether consciousness is grounded in speech, whether synthetic speech requires robots to have a basis in physical environment.
What Roy has undertaken is an unprecedented experiment in recording speech development of his son. He introduces the experiment in abstract terms, offering a goal, approach, differentiators and potential impact:
Goal – Advance the understanding of how children acquire language in natural contexts, rather than observing in the lab
Approach – “longitudinal, ultra-dense, in-vivo recording” coupled with data mining and interaction analysis
Differentiators – most projects try to minimize observer effects, make generalizations from very small data sets, possibly 1-2 hours of recording of mother/child interaction
Impact – understand child development, behavioral phenotyping, possibly leading to the detection and treatment of development disorders. The technology also has implications for video scrapbooking, parenting aids and retail behavior.
These abstract statements don’t do a very good job of capturing the audacity of the experiment. Roy has run 3000 feet of cable in his suburban house, installing fisheye lens cameras in the ceiling of each room as well as a powerful microphone. Installed by each light switch is a small handheld controller with four software “buttons” – the controller can turn on or off the cameras and microphones. It includes an “oops!” button, which is an “anti-tivo” – it lets you permanently erase the previous few minutes. The final button is a “diary” button, allowing parents to add notes to a particular moment in a recording, noting a special moment.
Each camera can be closed with a privacy shutter… but a daylong montage of data from nine cameras shows that Roy’s family chooses to live life under surveillance a great deal of the time – we watch the sun move through rooms, the quality of light change from daylight to incandescent light. He tells us that he’s recorded 28 months of data, including 80,000 hours of video and 120,000 hours of audio on a 200 terabyte disk array. (Seagate is one of his sponsors, and they’re learning a great deal about video encoding from his work.)
The immensity of this data raises some challenges for data analysis. Roy has built a tool called “Total Recall” which visualizes the activity in the house via spectrograms of audio, and “space-time worms” in video. The video analysis is particularly fascinating – the software looks at frames of video and isolates movement between frames. It preserves the areas of movement, captures the moving figure, and overlays the moving areas over a blank field over time. The results are “worms” of video snapshots that show movement through a space over time. The software allows the rapid overview of days or weeks of data at a time, seeking moments of movement or activity.
To get useful behavioral data from this raw information, Roy needs some other pieces of information. A system called BlitzScribing allows for rapid transcription of the 16 million words spoken in his house in the past 2+ years. Using very efficient, offline transcription, the entire data set has been transcribed for $120,000, which is an amazingly modest figure.
(Roy says that his first transcriber was also his nanny, which gave her a particular facility in transcribing his son’s utterances. It’s unrealistic, he tells us, to expect to do automated speech transcription. He tried an experiment, using adult to adult dialog in his house, which human transcribers were able to notate at 90%+ accuracy. He fed the data into the top speech-recognition engine – a system used by the US intelligence community – and got single-digit success rates.)
Video is more useful for purposes of analyzing speech acquisition when it’s got additional information about the attention and focus of everyone in each frame. In post-processing, Roy’s team adds information on the direction of a subject’s gaze, helping determine interactions between people. Roy is especially interested in moments of joint attention, where a parent and child are looking at the same object, which is likely a key part of language acquisition. He notes that autistic children often don’t display this joint attention, and this system may help serve as an early detection system for autism.
The amount of data generated by the system already matches the size of one of the most important data sets used in language acquisition research, the CHILDES data corpus, assembled by hundreds of research teams. Roy contends that to truly understand speech acquisition, we need huge amounts of fine-grained data over long periods of time. Using his data set, he shows us his son learning to say “ball” and “water” over nine month periods – it’s quite clear why this would be fascinating data for child speech researchers.
While the main audience for this work is scientists who study child development and autism, the technology developed has some interestin implications for retailers as well. Imagine tracking the behavior of all customers in a store via security camera. You could analyze their head movement and see whether they’re seeing the displays you want them to see, the products you want to sell, etc. Unsurprisingly, retailers are some of the supporters of Roy’s research.
A possible future direction for research is building a much simpler technical implementation of his tools. He shows us a floor lamp that holds recording gear in the base, and has the camera and microphone in the head – this device might be deployable in multiple rooms of a house and might let parents record their child’s development and produce useful data for doctors or researchers.
The question and answer session had at least two threads: questions on privacy, and questions on the utility of the research. On the utility side, it’s clear that one limitation of the data set is that it covers only one child in one specific family situation. Roy observes that there aren’t many large, long-term datasets even much earlier in the history of child development research. Diaries kept by researchers like Piaget are generally more useful for raising questions than drawing conclusions – conclusions come from much broader sets of experiments.
Questions about privacy were all over the map. I suggested that Roy’s tools might be very useful to Department of Homeland Security, as they might allow analysis of movement through security cameras. Roy pointed out that he’s got no money from DHS, that other groups are working on video analysis via the VACE (Video analysis and content extraction) system, and that his tools are not real-time, making them less useful for security applications.
Other people were more concerned with privacy issues in the home – do you make people sign a release when they enter your surveillance home? Or just warn them that they’re being recorded? What are the implications of taping interactions of a family that might later go through a divorce? What about the rights of a child being taped?
One of the most interesting questions came at the end of the talk, when Lewis Hyde asked, “What have you learned about language development from this experiment?” Roy’s answer: “Nothing.” The experiment is in the data collection phase at present – the hope is that the analysis would follow, conducted either by Roy’s team or by other researchers.
Excellent piece by Jonathon Keats on Roy’s work.
I also find myself wondering what Hasan Elahi, who is documenting his life moment to moment, as a complex experiment on surveillance, would make of this project.