Kicking neural network design automation into high gear

Algorithm designs optimized machine-learning models up to 200 times faster than traditional methods.

A new area in artificial intelligence involves using algorithms to automatically design machine-learning systems known as neural networks, which are more accurate and efficient than those developed by human engineers. But this so-called neural architecture search (NAS) technique is computationally expensive.

A state-of-the-art NAS algorithm recently developed by Google to run on a squad of graphical processing units (GPUs) took 48,000 GPU hours to produce a single convolutional neural network, which is used for image classification and detection tasks. Google has the wherewithal to run hundreds of GPUs and other specialized hardware in parallel, but that’s out of reach for many others.

In a paper being presented at the International Conference on Learning Representations in May, MIT researchers describe an NAS algorithm that can directly learn specialized convolutional neural networks (CNNs) for target hardware platforms — when run on a massive image dataset — in only 200 GPU hours, which could enable far broader use of these types of algorithms.

Resource-strapped researchers and companies could benefit from the time- and cost-saving algorithm, the researchers say. The broad goal is “to democratize AI,” says co-author Song Han, an assistant professor of electrical engineering and computer science and a researcher in the Microsystems Technology Laboratories at MIT. “We want to enable both AI experts and nonexperts to efficiently design neural network architectures with a push-button solution that runs fast on a specific hardware.”

Han adds that such NAS algorithms will never replace human engineers. “The aim is to offload the repetitive and tedious work that comes with designing and refining neural network architectures,” says Han, who is joined on the paper by two researchers in his group, Han Cai and Ligeng Zhu.

“Path-level” binarization and pruning

In their work, the researchers developed ways to delete unnecessary neural network design components, to cut computing times and use only a fraction of hardware memory to run a NAS algorithm. An additional innovation ensures each outputted CNN runs more efficiently on specific hardware platforms — CPUs, GPUs, and mobile devices — than those designed by traditional approaches. In tests, the researchers’ CNNs were 1.8 times faster measured on a mobile phone than traditional gold-standard models with similar accuracy.

A CNN’s architecture consists of layers of computation with adjustable parameters, called “filters,” and the possible connections between those filters. Filters process image pixels in grids of squares — such as 3x3, 5x5, or 7x7 — with each filter covering one square. The filters essentially move across the image and combine all the colors of their covered grid of pixels into a single pixel. Different layers may have different-sized filters, and connect to share data in different ways. The output is a condensed image — from the combined information from all the filters — that can be more easily analyzed by a computer.

Because the number of possible architectures to choose from — called the “search space” — is so large, applying NAS to create a neural network on massive image datasets is computationally prohibitive. Engineers typically run NAS on smaller proxy datasets and transfer their learned CNN architectures to the target task. This generalization method reduces the model’s accuracy, however. Moreover, the same outputted architecture also is applied to all hardware platforms, which leads to efficiency issues.

The researchers trained and tested their new NAS algorithm on an image classification task directly in the ImageNet dataset, which contains millions of images in a thousand classes. They first created a search space that contains all possible candidate CNN “paths” — meaning how the layers and filters connect to process the data. This gives the NAS algorithm free rein to find an optimal architecture.

This would typically mean all possible paths must be stored in memory, which would exceed GPU memory limits. To address this, the researchers leverage a technique called “path-level binarization,” which stores only one sampled path at a time and saves an order of magnitude in memory consumption. They combine this binarization with “path-level pruning,” a technique that traditionally learns which “neurons” in a neural network can be deleted without affecting the output. Instead of discarding neurons, however, the researchers’ NAS algorithm prunes entire paths, which completely changes the neural network’s architecture.

In training, all paths are initially given the same probability for selection. The algorithm then traces the paths — storing only one at a time — to note the accuracy and loss (a numerical penalty assigned for incorrect predictions) of their outputs. It then adjusts the probabilities of the paths to optimize both accuracy and efficiency. In the end, the algorithm prunes away all the low-probability paths and keeps only the path with the highest probability — which is the final CNN architecture.

Hardware-aware

Another key innovation was making the NAS algorithm “hardware-aware,” Han says, meaning it uses the latency on each hardware platform as a feedback signal to optimize the architecture. To measure this latency on mobile devices, for instance, big companies such as Google will employ a “farm” of mobile devices, which is very expensive. The researchers instead built a model that predicts the latency using only a single mobile phone.

For each chosen layer of the network, the algorithm samples the architecture on that latency-prediction model. It then uses that information to design an architecture that runs as quickly as possible, while achieving high accuracy. In experiments, the researchers’ CNN ran nearly twice as fast as a gold-standard model on mobile devices.

One interesting result, Han says, was that their NAS algorithm designed CNN architectures that were long dismissed as being too inefficient — but, in the researchers’ tests, they were actually optimized for certain hardware. For instance, engineers have essentially stopped using 7x7 filters, because they’re computationally more expensive than multiple, smaller filters. Yet, the researchers’ NAS algorithm found architectures with some layers of 7x7 filters ran optimally on GPUs. That’s because GPUs have high parallelization — meaning they compute many calculations simultaneously — so can process a single large filter at once more efficiently than processing multiple small filters one at a time.

“This goes against previous human thinking,” Han says. “The larger the search space, the more unknown things you can find. You don’t know if something will be better than the past human experience. Let the AI figure it out.”

The work was supported, in part, by the MIT Quest for Intelligence, the MIT-IBM Watson AI lab, SenseTime, and Xilinx.

“Particle robot” works as a cluster of simple units

Loosely connected disc-shaped “particles” can push and pull one another, moving en masse to transport objects.

Taking a cue from biological cells, researchers from MIT, Columbia University, and elsewhere have developed computationally simple robots that connect in large groups to move around, transport objects, and complete other tasks.

This so-called “particle robotics” system — based on a project by MIT, Columbia Engineering, Cornell University, and Harvard University researchers — comprises many individual disc-shaped units, which the researchers call “particles.” The particles are loosely connected by magnets around their perimeters, and each unit can only do two things: expand and contract. (Each particle is about 6 inches in its contracted state and about 9 inches when expanded.) That motion, when carefully timed, allows the individual particles to push and pull one another in coordinated movement. On-board sensors enable the cluster to gravitate toward light sources.

In a Nature paper published today, the researchers demonstrate a cluster of two dozen real robotic particles and a virtual simulation of up to 100,000 particles moving through obstacles toward a light bulb. They also show that a particle robot can transport objects placed in its midst.

Particle robots can form into many configurations and fluidly navigate around obstacles and squeeze through tight gaps. Notably, none of the particles directly communicate with or rely on one another to function, so particles can be added or subtracted without any impact on the group. In their paper, the researchers show particle robotic systems can complete tasks even when many units malfunction.

The paper represents a new way to think about robots, which are traditionally designed for one purpose, comprise many complex parts, and stop working when any part malfunctions. Robots made up of these simplistic components, the researchers say, could enable more scalable, flexible, and robust systems.

“We have small robot cells that are not so capable as individuals but can accomplish a lot as a group,” says Daniela Rus, director of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Andrew and Erna Viterbi Professor of Electrical Engineering and Computer Science. “The robot by itself is static, but when it connects with other robot particles, all of a sudden the robot collective can explore the world and control more complex actions. With these ‘universal cells,’ the robot particles can achieve different shapes, global transformation, global motion, global behavior, and, as we have shown in our experiments, follow gradients of light. This is very powerful.”

Joining Rus on the paper are: first author Shuguang Li, a CSAIL postdoc; co-first author Richa Batra and corresponding author Hod Lipson, both of Columbia Engineering; David Brown, Hyun-Dong Chang, and Nikhil Ranganathan of Cornell; and Chuck Hoberman of Harvard.

At MIT, Rus has been working on modular, connected robots for nearly 20 years, including an expanding and contracting cube robot that could connect to others to move around. But the square shape limited the robots’ group movement and configurations.

In collaboration with Lipson’s lab, where Li was a postdoc until coming to MIT in 2014, the researchers went for disc-shaped mechanisms that can rotate around one another. They can also connect and disconnect from each other, and form into many configurations.

Each unit of a particle robot has a cylindrical base, which houses a battery, a small motor, sensors that detect light intensity, a microcontroller, and a communication component that sends out and receives signals. Mounted on top is a children’s toy called a Hoberman Flight Ring — its inventor is one of the paper’s co-authors — which consists of small panels connected in a circular formation that can be pulled to expand and pushed back to contract. Two small magnets are installed in each panel.

The trick was programming the robotic particles to expand and contract in an exact sequence to push and pull the whole group toward a destination light source. To do so, the researchers equipped each particle with an algorithm that analyzes broadcasted information about light intensity from every other particle, without the need for direct particle-to-particle communication.

The sensors of a particle detect the intensity of light from a light source; the closer the particle is to the light source, the greater the intensity. Each particle constantly broadcasts a signal that shares its perceived intensity level with all other particles. Say a particle robotic system measures light intensity on a scale of levels 1 to 10: Particles closest to the light register a level 10 and those furthest will register level 1. The intensity level, in turn, corresponds to a specific time that the particle must expand. Particles experiencing the highest intensity — level 10 — expand first. As those particles contract, the next particles in order, level 9, then expand. That timed expanding and contracting motion happens at each subsequent level.

“This creates a mechanical expansion-contraction wave, a coordinated pushing and dragging motion, that moves a big cluster toward or away from environmental stimuli,” Li says. The key component, Li adds, is the precise timing from a shared synchronized clock among the particles that enables movement as efficiently as possible: “If you mess up the synchronized clock, the system will work less efficiently.”

In videos, the researchers demonstrate a particle robotic system comprising real particles moving and changing directions toward different light bulbs as they’re flicked on, and working its way through a gap between obstacles. In their paper, the researchers also show that simulated clusters of up to 10,000 particles maintain locomotion, at half their speed, even with up to 20 percent of units failed.

“It’s a bit like the proverbial ‘gray goo,’” says Lipson, a professor of mechanical engineering at Columbia Engineering, referencing the science-fiction concept of a self-replicating robot that comprises billions of nanobots. “The key novelty here is that you have a new kind of robot that has no centralized control, no single point of failure, no fixed shape, and its components have no unique identity.”

The next step, Lipson adds, is miniaturizing the components to make a robot composed of millions of microscopic particles.

"The work points toward an innovative new direction in modular and distributed robotics,” says Mac Schwager, an assistant professor of aeronautics and astronautics and director of the Multi-robot Systems Lab at Stanford University. “The authors use collectives of simple stochastic robotic cells, and leverage the statistics of the collective to achieve a global motion. This has some similarity to biological systems, in which the cells of an organism each follow some random process, while the bulk effect of this low-level randomness leads to a predictable behavior for the whole organism. The hope is that such robot collectives will yield robust and adaptable behaviors, similar to the robustness and adaptability we see in nature."

Addressing the promises and challenges of AI

Final day of the MIT Schwarzman College of Computing celebration explores enthusiasm, caution about AI’s rising prominence in society.

A three-day celebration event this week for the MIT Stephen A. Schwarzman College of Computing put focus on the Institute’s new role in helping society navigate a promising yet challenging future for artificial intelligence (AI), as it seeps into nearly all aspects of society.

On Thursday, the final day of the event, a series of talks and panel discussions by researchers and industry experts conveyed enthusiasm for AI-enabled advances in many global sectors, but emphasized concerns — on topics such as data privacy, job automation, and personal and social issues — that accompany the computing revolution. The day also included a panel called “Computing for the People: Ethics and AI,” whose participants agreed collaboration is key to make sure artificial intelligence serves the public good.

Kicking off the day’s events, MIT President Rafael Reif said the MIT Schwarzman College of Computing will train students in an interdisciplinary approach to AI. It will also train them to take a step back and weigh potential downsides of AI, which is poised to disrupt “every sector of our society.”

“Everyone knows pushing the limits of new technologies can be so thrilling that it’s hard to think about consequences and how [AI] too might be misused,” Reif said. “It is time to educate a new generation of technologists in the public interest, and I’m optimistic that the MIT Schwarzman College [of Computing] is the right place for that job.”

In opening remarks, Massachusetts Governor Charlie Baker gave MIT “enormous credit” for focusing its research and education on the positive and negative impact of AI. “Having a place like MIT … think about the whole picture in respect to what this is going to mean for individuals, businesses, governments, and society is a gift,” he said.

Personal and industrial AI

In a panel discussion titled, “Computing the Future: Setting New Directions,” MIT alumnus Drew Houston ’05, co-founder of Dropbox, described an idyllic future where by 2030 AI could take over many tedious professional tasks, freeing humans to be more creative and productive.

Workers today, Houston said, spend more than 60 percent of their working lives organizing emails, coordinating schedules, and planning various aspects of their job. As computers start refining skills — such as analyzing and answering queries in natural language, and understanding very complex systems — each of us may soon have AI-based assistants that can handle many of those mundane tasks, he said.

“We’re on the eve of a new generation of our partnership with machines … where machines will take a lot of the busy work so people can … spend our working days on the subset of our work that’s really fulfilling and meaningful,” Houston said. “My hope is that, in 2030, we’ll look back on now as the beginning of a revolution that freed our minds the way the industrial revolution freed our hands. My last hope is that … the new [MIT Schwarzman College of Computing] is the place where that revolution is born.”   

Speaking with reporters before the panel discussion “Computing for the Marketplace: Entrepreneurship and AI,” Eric Schmidt, former executive chairman of Alphabet and a visiting innovation fellow at MIT, also spoke of a coming age of AI assistants. Smart teddy bears could help children learn language, virtual assistants could plan people’s days, and personal robots could ensure the elderly take medication on schedule. “This model of an assistant … is at the basis of the vision of how people will see a difference in our lives every day,” Schmidt said.

He noted many emerging AI-based research and business opportunities, including analyzing patient data to predict risk of diseases, discovering new compounds for drug discovery, and predicting regions where wind farms produce the most power, which is critical for obtaining clean-energy funding. “MIT is at the forefront of every single example that I just gave,” Schmidt said.

When asked by panel moderator Katie Rae, executive director of The Engine, what she thinks is the most significant aspect of AI in industry, iRobot co-founder Helen Greiner cited supply chain automation. Robots could, for instance, package goods more quickly and efficiently, and driverless delivery trucks could soon deliver those packages, she said: “Logistics in general will be changed” in the coming years.

Finding an algorithmic utopia

For Institute Professor Robert Langer, another panelist in “Computing for the Marketplace,” AI holds great promise for early disease diagnoses. With enough medical data, for instance, AI models can identify biological “fingerprints” of certain diseases in patients. “Then, you can use AI to analyze those fingerprints and decide what … gives someone a risk of cancer,” he said. “You can do drug testing that way too. You can see [a patient has] a fingerprint that … shows you that a drug will treat the cancer for that person.”

But in the “Computing the Future” section, David Siegel, co-chair of Two Sigma Investments and founding advisor for the MIT Quest for Intelligence, addressed issues with data, which is at the heart of AI. With the aid of AI, Siegel has seen computers go from helpful assistants to “routinely making decisions for people” in business, health care, and other areas. While AI models can benefit the world, “there is a fear that we may move in a direction that’s far from an algorithmic utopia.”

Siegel drew parallels between AI and the popular satirical film “Dr. Strangelove,” in which an “algorithmic doomsday machine” threatens to destroy the world. AI algorithms must be made unbiased, safe, and secure, he said. That involves dedicated research in several important areas, at the MIT Schwarzman College of Computing and around the globe, “to avoid a Strangelove-like future.”

One important area is data bias and security. Data bias, for instance, leads to inaccurate and untrustworthy algorithms. And if researchers can guarantee the privacy of medical data, he added, patients may be more willing to contribute their records to medical research.

Siegel noted a real-world example where, due to privacy concerns, the Centers for Medicare and Medicaid Services years ago withheld patient records from a large research dataset being used to study substance misuse, which is responsible for tens of thousands of U.S. deaths annually. “That omission was a big loss for researchers and, by extension, patients,” he said. “We are missing the opportunity to solve pressing problems because of the lack of accessible data. … Without solutions, the algorithms that drive our world are at high risk of becoming data-compromised.”

Seeking humanity in AI

In a panel discussion earlier in the day, “Computing: Reflections and the Path Forward,” Sherry Turkle, the Abby Rockefeller Mauzé Professor of the Social Studies of Science and Technology, called on people to avoid “friction free” technologies — which help people avoid stress of face-to-face interactions.

AI is now “deeply woven into this [friction-free] story,” she said, noting that there are apps that help users plan walking routes, for example, to avoid people they dislike. “But who said a life without conflict … makes for the good life?” she said.

She concluded with a “call to arms” for the new college to help people understand the consequences of the digital world where confrontation is avoided, social media are scrutinized, and personal data are sold and shared with companies and governments: “It’s time to reclaim our attention, our solitude, our privacy, and our democracy.”

Speaking in the same section, Patrick H. Winston, the Ford Professor of Engineering at MIT, concluded on an equally humanistic — and optimistic — message. After walking the audience through the history of AI at MIT, including his run as director of the Artificial Intelligence Laboratory from 1972 to 1997, he told the audience he was going to discuss the greatest computing innovation of all time.

“It’s us,” he said, “because nothing can think like we can. We don’t know how to make computers do it yet, but it’s something we should aspire to. … In the end, there’s no reason why computers can’t think like we [do] and can’t be ethical and moral like we aspire to be.”

Building site identified for MIT Stephen A. Schwarzman College of Computing

Headquarters would replace Building 44, forming an “entrance to computing” near the intersection of Vassar and Main streets.

MIT has identified a preferred location for the new MIT Stephen A. Schwarzman College of Computing headquarters: the current site of Building 44. The new building, which will require permitting and approvals from the City of Cambridge, will sit in a centralized location that promises to unite the many MIT departments, centers, and labs that integrate computing into their work.

In October, MIT announced a $1 billion commitment to address the global opportunities and challenges presented by the prevalence of computing and the rise of artificial intelligence (AI) — the single largest investment in computing and AI by a U.S. academic institution. At the heart of the initiative is the new college, made possible by a $350 million foundational gift from Mr. Schwarzman, the chairman, CEO and co-founder of Blackstone, a global asset management and financial services firm.

The college aims to: connect advances in computer science and machine learning with advances in MIT’s other academic disciplines; create 50 new faculty positions within the college and jointly with existing academic departments; give MIT’s five schools a shared structure for collaborative education, research, and innovation in computing and artificial intelligence; educate all students to responsibly use and develop computing technologies to address pressing societal and global resource challenges; and focus on public policy and ethical considerations relevant to computing, when applied to human-machine interfaces, autonomous operations, and data analytics.

With those goals in mind, MIT aims to construct a building, large enough to house 50 faculty groups, to replace Building 44, which sits in the center of the Vassar Street block between Main Street and Massachusetts Avenue. Those currently working in Building 44 will be relocated to other buildings on campus.

Scheduled for completion in late 2022, the new building will serve as an interdisciplinary hub for research and innovation in computer science, AI, data science, and related fields that deal with computing advances, including how new computing methods can both address and pose societal challenges. It will stand in close proximity to a cluster of computing- and AI-focused departments, centers, and labs located directly across the street and running up to the intersection of Vassar and Main Streets. All other buildings on campus are about a six-minute walk away.

“You can think of this intersection of Vassar and Main as the ‘entrance to computing,’” says Associate Provost Krystyn Van Vliet, who is responsible for Institute space planning, assignment, and renovation under the direction of the Building Committee, which is chaired by MIT Provost Marty Schmidt and Executive Vice President and Treasurer Israel Ruiz. Van Vliet also oversees MIT’s industrial engagement efforts, including MIT’s Office of Corporate Relations and the Technology Licensing Office.

“The building is intended as a convening space for everyone working to create and shape computing — not just computer scientists, but people who have expertise in the humanities and arts, or science, or architecture and urban planning, or business, or engineering,” Schmidt adds.

Everyone currently located in Building 44 will be moved to their new campus locations by late summer of 2019. Demolition is scheduled to begin in the fall.

While a final design is still months away, a key planned feature for the building will be “convening spaces,” which will include areas set for interdisciplinary seminars and conferences, and potentially an “open office” concept that promotes mixing and mingling. “You can imagine a graduate student from the humanities and a postdoc from EECS working on a project together,” says Dean of the School of Engineering Anantha P. Chandrakasan, the Vannevar Bush Professor of Electrical Engineering and Computer Science. “Such a building can serve as a place for broad community collaboration and research.”

The centralized location is key to the college’s interdisciplinary mission. Building 44 sits directly across the street from Building 38, which houses the Department of Electrical Engineering and Computer Science; the Stata Center, which the Computer Science and Artificial Intelligence Laboratory (CSAIL) calls home; and the Research Laboratory of Electronics in Building 36.

Down the road, on the corner of Main Street, stands the Koch Institute for Integrative Cancer Research and the Broad Institute of MIT and Harvard, both of which incorporate computer science and AI into cancer and medical research. Buildings behind the headquarters on Main Street, in the area known as “Technology Square,” contain many biological engineering, nanotechnology, and biophysics labs.

The new building will also neighbor — and possibly connect to — Building 46, which houses the Department of Brain and Cognitive Sciences, the Picower Institute for Learning and Memory, and the McGovern Institute for Brain Research. “When you think about the work of connecting human intelligence and machine intelligence through computing — which can be physically connected to a building where people are working on understanding human intelligence and cognition — that’s exciting,” Van Vliet says.

The building could thus help “activate” Vassar Street, she adds, because buildings along the street are somewhat visually closed off to the public. The new building, she says, could include windows with displays that visually highlight the research conducted behind the walls, like peering into the labs along the MIT halls.

“Right now, when you walk down Vassar Street, people don’t know what’s happening inside most of these buildings,” she says. “By activation, we mean there’s more community interaction and pedestrian traffic, and more visible displays that draw the public into campus and make them aware of what’s going on at MIT. It will help us show the breadth of MIT’s activities all the way down Vassar Street, for both the growing MIT community and our neighbors.”

A series of launch events for the MIT Schwarzman College of Computing is planned for late February 2019. The search for the college’s dean is ongoing.

Model can more naturally detect depression in conversations

Neural network learns speech patterns that predict depression in clinical interviews.

To diagnose depression, clinicians interview patients, asking specific questions — about, say, past mental illnesses, lifestyle, and mood — and identify the condition based on the patient’s responses.

In recent years, machine learning has been championed as a useful aid for diagnostics. Machine-learning models, for instance, have been developed that can detect words and intonations of speech that may indicate depression. But these models tend to predict that a person is depressed or not, based on the person’s specific answers to specific questions. These methods are accurate, but their reliance on the type of question being asked limits how and where they can be used.

In a paper being presented at the Interspeech conference, MIT researchers detail a neural-network model that can be unleashed on raw text and audio data from interviews to discover speech patterns indicative of depression. Given a new subject, it can accurately predict if the individual is depressed, without needing any other information about the questions and answers.

The researchers hope this method can be used to develop tools to detect signs of depression in natural conversation. In the future, the model could, for instance, power mobile apps that monitor a user’s text and voice for mental distress and send alerts. This could be especially useful for those who can’t get to a clinician for an initial diagnosis, due to distance, cost, or a lack of awareness that something may be wrong.

“The first hints we have that a person is happy, excited, sad, or has some serious cognitive condition, such as depression, is through their speech,” says first author Tuka Alhanai, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL). “If you want to deploy [depression-detection] models in scalable way … you want to minimize the amount of constraints you have on the data you’re using. You want to deploy it in any regular conversation and have the model pick up, from the natural interaction, the state of the individual.”

The technology could still, of course, be used for identifying mental distress in casual conversations in clinical offices, adds co-author James Glass, a senior research scientist in CSAIL. “Every patient will talk differently, and if the model sees changes maybe it will be a flag to the doctors,” he says. “This is a step forward in seeing if we can do something assistive to help clinicians.”

The other co-author on the paper is Mohammad Ghassemi, a member of the Institute for Medical Engineering and Science (IMES).

Context-free modeling

The key innovation of the model lies in its ability to detect patterns indicative of depression, and then map those patterns to new individuals, with no additional information. “We call it ‘context-free,’ because you’re not putting any constraints into the types of questions you’re looking for and the type of responses to those questions,” Alhanai says.

Other models are provided with a specific set of questions, and then given examples of how a person without depression responds and examples of how a person with depression responds — for example, the straightforward inquiry, “Do you have a history of depression?” It uses those exact responses to then determine if a new individual is depressed when asked the exact same question. “But that’s not how natural conversations work,” Alhanai says.   

The researchers, on the other hand, used a technique called sequence modeling, often used for speech processing. With this technique, they fed the model sequences of text and audio data from questions and answers, from both depressed and non-depressed individuals, one by one. As the sequences accumulated, the model extracted speech patterns that emerged for people with or without depression. Words such as, say, “sad,” “low,” or “down,” may be paired with audio signals that are flatter and more monotone. Individuals with depression may also speak slower and use longer pauses between words. These text and audio identifiers for mental distress have been explored in previous research. It was ultimately up to the model to determine if any patterns were predictive of depression or not.

“The model sees sequences of words or speaking style, and determines that these patterns are more likely to be seen in people who are depressed or not depressed,” Alhanai says. “Then, if it sees the same sequences in new subjects, it can predict if they’re depressed too.”

This sequencing technique also helps the model look at the conversation as a whole and note differences between how people with and without depression speak over time.

Detecting depression

The researchers trained and tested their model on a dataset of 142 interactions from the Distress Analysis Interview Corpus that contains audio, text, and video interviews of patients with mental-health issues and virtual agents controlled by humans. Each subject is rated in terms of depression on a scale between 0 to 27, using the Personal Health Questionnaire. Scores above a cutoff between moderate (10 to 14) and moderately severe (15 to 19) are considered depressed, while all others below that threshold are considered not depressed. Out of all the subjects in the dataset, 28 (20 percent) are labeled as depressed.

In experiments, the model was evaluated using metrics of precision and recall. Precision measures which of the depressed subjects identified by the model were diagnosed as depressed. Recall measures the accuracy of the model in detecting all subjects who were diagnosed as depressed in the entire dataset. In precision, the model scored 71 percent and, on recall, scored 83 percent. The averaged combined score for those metrics, considering any errors, was 77 percent. In the majority of tests, the researchers’ model outperformed nearly all other models.

One key insight from the research, Alhanai notes, is that, during experiments, the model needed much more data to predict depression from audio than text. With text, the model can accurately detect depression using an average of seven question-answer sequences. With audio, the model needed around 30 sequences. “That implies that the patterns in words people use that are predictive of depression happen in shorter time span in text than in audio,” Alhanai says. Such insights could help the MIT researchers, and others, further refine their models.

This work represents a “very encouraging” pilot, Glass says. But now the researchers seek to discover what specific patterns the model identifies across scores of raw data. “Right now it’s a bit of a black box,” Glass says. “These systems, however, are more believable when you have an explanation of what they’re picking up. … The next challenge is finding out what data it’s seized upon.”

The researchers also aim to test these methods on additional data from many more subjects with other cognitive conditions, such as dementia. “It’s not so much detecting depression, but it’s a similar concept of evaluating, from an everyday signal in speech, if someone has cognitive impairment or not,” Alhanai says.

Artificial intelligence model “learns” from patient data to make cancer treatment less toxic

Machine-learning system determines the fewest, smallest doses that could still shrink brain tumors.

MIT researchers are employing novel machine-learning techniques to improve the quality of life for patients by reducing toxic chemotherapy and radiotherapy dosing for glioblastoma, the most aggressive form of brain cancer.

Glioblastoma is a malignant tumor that appears in the brain or spinal cord, and prognosis for adults is no more than five years. Patients must endure a combination of radiation therapy and multiple drugs taken every month. Medical professionals generally administer maximum safe drug doses to shrink the tumor as much as possible. But these strong pharmaceuticals still cause debilitating side effects in patients.

In a paper being presented next week at the 2018 Machine Learning for Healthcare conference at Stanford University, MIT Media Lab researchers detail a model that could make dosing regimens less toxic but still effective. Powered by a “self-learning” machine-learning technique, the model looks at treatment regimens currently in use, and iteratively adjusts the doses. Eventually, it finds an optimal treatment plan, with the lowest possible potency and frequency of doses that should still reduce tumor sizes to a degree comparable to that of traditional regimens.

In simulated trials of 50 patients, the machine-learning model designed treatment cycles that reduced the potency to a quarter or half of nearly all the doses while maintaining the same tumor-shrinking potential. Many times, it skipped doses altogether, scheduling administrations only twice a year instead of monthly.

“We kept the goal, where we have to help patients by reducing tumor sizes but, at the same time, we want to make sure the quality of life — the dosing toxicity — doesn’t lead to overwhelming sickness and harmful side effects,” says Pratik Shah, a principal investigator at the Media Lab who supervised this research.

The paper’s first author is Media Lab researcher Gregory Yauney.

Rewarding good choices

The researchers’ model uses a technique called reinforced learning (RL), a method inspired by behavioral psychology, in which a model learns to favor certain behavior that leads to a desired outcome.

The technique comprises artificially intelligent “agents” that complete “actions” in an unpredictable, complex environment to reach a desired “outcome.” Whenever it completes an action, the agent receives a “reward” or “penalty,” depending on whether the action works toward the outcome. Then, the agent adjusts its actions accordingly to achieve that outcome.

Rewards and penalties are basically positive and negative numbers, say +1 or -1. Their values vary by the action taken, calculated by probability of succeeding or failing at the outcome, among other factors. The agent is essentially trying to numerically optimize all actions, based on reward and penalty values, to get to a maximum outcome score for a given task.

The approach was used to train the computer program DeepMind that in 2016 made headlines for beating one of the world’s best human players in the game “Go.” It’s also used to train driverless cars in maneuvers, such as merging into traffic or parking, where the vehicle will practice over and over, adjusting its course, until it gets it right.

The researchers adapted an RL model for glioblastoma treatments that use a combination of the drugs temozolomide (TMZ) and procarbazine, lomustine, and vincristine (PVC), administered over weeks or months.

The model’s agent combs through traditionally administered regimens. These regimens are based on protocols that have been used clinically for decades and are based on animal testing and various clinical trials. Oncologists use these established protocols to predict how much doses to give patients based on weight.

As the model explores the regimen, at each planned dosing interval — say, once a month — it decides on one of several actions. It can, first, either initiate or withhold a dose. If it does administer, it then decides if the entire dose, or only a portion, is necessary. At each action, it pings another clinical model — often used to predict a tumor’s change in size in response to treatments — to see if the action shrinks the mean tumor diameter. If it does, the model receives a reward.

However, the researchers also had to make sure the model doesn’t just dish out a maximum number and potency of doses. Whenever the model chooses to administer all full doses, therefore, it gets penalized, so instead chooses fewer, smaller doses. “If all we want to do is reduce the mean tumor diameter, and let it take whatever actions it wants, it will administer drugs irresponsibly,” Shah says. “Instead, we said, ‘We need to reduce the harmful actions it takes to get to that outcome.’”

This represents an “unorthodox RL model, described in the paper for the first time,” Shah says, that weighs potential negative consequences of actions (doses) against an outcome (tumor reduction). Traditional RL models work toward a single outcome, such as winning a game, and take any and all actions that maximize that outcome. On the other hand, the researchers’ model, at each action, has flexibility to find a dose that doesn’t necessarily solely maximize tumor reduction, but that strikes a perfect balance between maximum tumor reduction and low toxicity. This technique, he adds, has various medical and clinical trial applications, where actions for treating patients must be regulated to prevent harmful side effects.

Optimal regimens

The researchers trained the model on 50 simulated patients, randomly selected from a large database of glioblastoma patients who had previously undergone traditional treatments. For each patient, the model conducted about 20,000 trial-and-error test runs. Once training was complete, the model learned parameters for optimal regimens. When given new patients, the model used those parameters to formulate new regimens based on various constraints the researchers provided.

The researchers then tested the model on 50 new simulated patients and compared the results to those of a conventional regimen using both TMZ and PVC. When given no dosage penalty, the model designed nearly identical regimens to human experts. Given small and large dosing penalties, however, it substantially cut the doses’ frequency and potency, while reducing tumor sizes.

The researchers also designed the model to treat each patient individually, as well as in a single cohort, and achieved similar results (medical data for each patient was available to the researchers). Traditionally, a same dosing regimen is applied to groups of patients, but differences in tumor size, medical histories, genetic profiles, and biomarkers can all change how a patient is treated. These variables are not considered during traditional clinical trial designs and other treatments, often leading to poor responses to therapy in large populations, Shah says.

“We said [to the model], ‘Do you have to administer the same dose for all the patients? And it said, ‘No. I can give a quarter dose to this person, half to this person, and maybe we skip a dose for this person.’ That was the most exciting part of this work, where we are able to generate precision medicine-based treatments by conducting one-person trials using unorthodox machine-learning architectures,” Shah says.

The model offers a major improvement over the conventional “eye-balling” method of administering doses, observing how patients respond, and adjusting accordingly, says Nicholas J. Schork, a professor and director of human biology at the J. Craig Venter Institute, and an expert in clinical trial design. “[Humans don’t] have the in-depth perception that a machine looking at tons of data has, so the human process is slow, tedious, and inexact,” he says. “Here, you’re just letting a computer look for patterns in the data, which would take forever for a human to sift through, and use those patterns to find optimal doses.”

Schork adds that this work may particularly interest the U.S. Food and Drug Administration, which is now seeking ways to leverage data and artificial intelligence to develop health technologies. Regulations still need be established, he says, “but I don’t doubt, in a short amount of time, the FDA will figure out how to vet these [technologies] appropriately, so they can be used in everyday clinical programs.”

Helping computers perceive human emotions

Personalized machine-learning models capture subtle variations in facial expressions to better gauge how we feel.

MIT Media Lab researchers have developed a machine-learning model that takes computers a step closer to interpreting our emotions as naturally as humans do.

In the growing field of “affective computing,” robots and computers are being developed to analyze facial expressions, interpret our emotions, and respond accordingly. Applications include, for instance, monitoring an individual’s health and well-being, gauging student interest in classrooms, helping diagnose signs of certain diseases, and developing helpful robot companions.

A challenge, however, is people express emotions quite differently, depending on many factors. General differences can be seen among cultures, genders, and age groups. But other differences are even more fine-grained: The time of day, how much you slept, or even your level of familiarity with a conversation partner leads to subtle variations in the way you express, say, happiness or sadness in a given moment.

Human brains instinctively catch these deviations, but machines struggle. Deep-learning techniques were developed in recent years to help catch the subtleties, but they’re still not as accurate or as adaptable across different populations as they could be.

The Media Lab researchers have developed a machine-learning model that outperforms traditional systems in capturing these small facial expression variations, to better gauge mood while training on thousands of images of faces. Moreover, by using a little extra training data, the model can be adapted to an entirely new group of people, with the same efficacy. The aim is to improve existing affective-computing technologies.

“This is an unobtrusive way to monitor our moods,” says Oggi Rudovic, a Media Lab researcher and co-author on a paper describing the model, which was presented last week at the Conference on Machine Learning and Data Mining. “If you want robots with social intelligence, you have to make them intelligently and naturally respond to our moods and emotions, more like humans.”

Co-authors on the paper are: first author Michael Feffer, an undergraduate student in electrical engineering and computer science; and Rosalind Picard, a professor of media arts and sciences and founding director of the Affective Computing research group.

Personalized experts

Traditional affective-computing models use a “one-size-fits-all” concept. They train on one set of images depicting various facial expressions, optimizing features — such as how a lip curls when smiling — and mapping those general feature optimizations across an entire set of new images.

The researchers, instead, combined a technique, called “mixture of experts” (MoE), with model personalization techniques, which helped mine more fine-grained facial-expression data from individuals. This is the first time these two techniques have been combined for affective computing, Rudovic says.

In MoEs, a number of neural network models, called “experts,” are each trained to specialize in a separate processing task and produce one output. The researchers also incorporated a “gating network,” which calculates probabilities of which expert will best detect moods of unseen subjects. “Basically the network can discern between individuals and say, ‘This is the right expert for the given image,’” Feffer says.

For their model, the researchers personalized the MoEs by matching each expert to one of 18 individual video recordings in the RECOLA database, a public database of people conversing on a video-chat platform designed for affective-computing applications. They trained the model using nine subjects and evaluated them on the other nine, with all videos broken down into individual frames.

Each expert, and the gating network, tracked facial expressions of each individual, with the help of a residual network (“ResNet”), a neural network used for object classification. In doing so, the model scored each frame based on level of valence (pleasant or unpleasant) and arousal (excitement) — commonly used metrics to encode different emotional states. Separately, six human experts labeled each frame for valence and arousal, based on a scale of -1 (low levels) to 1 (high levels), which the model also used to train.

The researchers then performed further model personalization, where they fed the trained model data from some frames of the remaining videos of subjects, and then tested the model on all unseen frames from those videos. Results showed that, with just 5 to 10 percent of data from the new population, the model outperformed traditional models by a large margin — meaning it scored valence and arousal on unseen images much closer to the interpretations of human experts.

This shows the potential of the models to adapt from population to population, or individual to individual, with very few data, Rudovic says. “That’s key,” he says. “When you have a new population, you have to have a way to account for shifting of data distribution [subtle facial variations]. Imagine a model set to analyze facial expressions in one culture that needs to be adapted for a different culture. Without accounting for this data shift, those models will underperform. But if you just sample a bit from a new culture to adapt our model, these models can do much better, especially on the individual level. This is where the importance of the model personalization can best be seen.”

Currently available data for such affective-computing research isn’t very diverse in skin colors, so the researchers’ training data were limited. But when such data become available, the model can be trained for use on more diverse populations. The next step, Feffer says, is to train the model on “a much bigger dataset with more diverse cultures.”

Better machine-human interactions

Another goal is to train the model to help computers and robots automatically learn from small amounts of changing data to more naturally detect how we feel and better serve human needs, the researchers say.

It could, for example, run in the background of a computer or mobile device to track a user’s video-based conversations and learn subtle facial expression changes under different contexts. “You can have things like smartphone apps or websites be able to tell how people are feeling and recommend ways to cope with stress or pain, and other things that are impacting their lives negatively,” Feffer says.

This could also be helpful in monitoring, say, depression or dementia, as people’s facial expressions tend to subtly change due to those conditions. “Being able to passively monitor our facial expressions,” Rudovic says, “we could over time be able to personalize these models to users and monitor how much deviations they have on daily basis — deviating from the average level of facial expressiveness — and use it for indicators of well-being and health.”

A promising application, Rudovic says, is human-robotic interactions, such as for personal robotics or robots used for educational purposes, where the robots need to adapt to assess the emotional states of many different people. One version, for instance, has been used in helping robots better interpret the moods of children with autism.

Roddy Cowie, professor emeritus of psychology at the Queen’s University Belfast and an affective computing scholar, says the MIT work “illustrates where we really are” in the field. “We are edging toward systems that can roughly place, from pictures of people’s faces, where they lie on scales from very positive to very negative, and very active to very passive,” he says. “It seems intuitive that the emotional signs one person gives are not the same as the signs another gives, and so it makes a lot of sense that emotion recognition works better when it is personalized. The method of personalizing reflects another intriguing point, that it is more effective to train multiple ‘experts,’ and aggregate their judgments, than to train a single super-expert. The two together make a satisfying package.”

Automating molecule design to speed up drug development

Machine-learning model could help chemists make molecules with higher potencies, much more quickly.

Designing new molecules for pharmaceuticals is primarily a manual, time-consuming process that’s prone to error. But MIT researchers have now taken a step toward fully automating the design process, which could drastically speed things up — and produce better results.

Drug discovery relies on lead optimization. In this process, chemists select a target (“lead”) molecule with known potential to interact with a specific biological target, then tweak its chemical properties for higher potency and other factors.

Chemists use expert knowledge and conduct manual tweaking of the structure of molecules, adding and subtracting functional groups — groups of atoms and bonds with specific properties. Even when they use systems that predict optimal desired properties, chemists still need to do each modification step themselves. This can take a significant amount of time at each step and still not produce molecules with desired properties.

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Department of Electrical Engineering and Computer Science (EECS) have developed a model that better selects lead molecule candidates based on desired properties. It also modifies the molecular structure needed to achieve a higher potency, while ensuring the molecule is still chemically valid.

The model basically takes as input molecular structure data and directly creates molecular graphs — detailed representations of a molecular structure, with nodes representing atoms and edges representing bonds. It breaks those graphs down into smaller clusters of valid functional groups that it uses as “building blocks” that help it more accurately reconstruct and better modify molecules.

“The motivation behind this was to replace the inefficient human modification process of designing molecules with automated iteration and assure the validity of the molecules we generate,” says Wengong Jin, a PhD student in CSAIL and lead author of a paper describing the model that’s being presented at the 2018 International Conference on Machine Learning in July.

Joining Jin on the paper are Regina Barzilay, the Delta Electronics Professor at CSAIL and EECS and Tommi S. Jaakkola, the Thomas Siebel Professor of Electrical Engineering and Computer Science in CSAIL, EECS, and at the Institute for Data, Systems, and Society.

The research was conducted as part of the Machine Learning for Pharmaceutical Discovery and Synthesis Consortium between MIT and eight pharmaceutical companies, announced in May. The consortium identified lead optimization as one key challenge in drug discovery.

“Today, it’s really a craft, which requires a lot of skilled chemists to succeed, and that’s what we want to improve,” Barzilay says. “The next step is to take this technology from academia to use on real pharmaceutical design cases, and demonstrate that it can assist human chemists in doing their work, which can be challenging.”

“Automating the process also presents new machine-learning challenges,” Jaakkola says. “Learning to relate, modify, and generate molecular graphs drives new technical ideas and methods.”

Generating molecular graphs

Systems that attempt to automate molecule design have cropped up in recent years, but their problem is validity. Those systems, Jin says, often generate molecules that are invalid under chemical rules, and they fails to produce molecules with optimal properties. This essentially makes full automation of molecule design infeasible.

These systems run on linear notations of molecules, called “simplified molecular-input line-entry systems,” or SMILES, where long strings of letters, numbers, and symbols represent individual atoms or bonds that can be interpreted by computer software. As the system modifies a lead molecule, it expands its string representation symbol by symbol — atom by atom, and bond by bond — until it generates a final SMILES string with higher potency of a desired property. In the end, the system may produce a final SMILES string that seems valid under SMILES grammar, but is actually invalid.

The researchers solve this issue by building a model that runs directly on molecular graphs, instead of SMILES strings, which can be modified more efficiently and accurately.

Powering the model is a custom variational autoencoder — a neural network that “encodes” an input molecule into a vector, which is basically a storage space for the molecule’s structural data, and then “decodes” that vector to a graph that matches the input molecule.

At encoding phase, the model breaks down each molecular graph into clusters, or “subgraphs,” each of which represents a specific building block. Such clusters are automatically constructed by a common machine-learning concept, called tree decomposition, where a complex graph is mapped into a tree structure of clusters — “which gives a scaffold of the original graph,” Jin says.

Both scaffold tree structure and molecular graph structure are encoded into their own vectors, where molecules are group together by similarity. This makes finding and modifying molecules an easier task.

At decoding phase, the model reconstructs the molecular graph in a “coarse-to-fine” manner — gradually increasing resolution of a low-resolution image to create a more refined version. It first generates the tree-structured scaffold, and then assembles the associated clusters (nodes in the tree) together into a coherent molecular graph. This ensures the reconstructed molecular graph is an exact replication of the original structure.

For lead optimization, the model can then modify lead molecules based on a desired property. It does so with aid of a prediction algorithm that scores each molecule with a potency value of that property. In the paper, for instance, the researchers sought molecules with a combination of two properties — high solubility and synthetic accessibility.

Given a desired property, the model optimizes a lead molecule by using the prediction algorithm to modify its vector — and, therefore, structure — by editing the molecule’s functional groups to achieve a higher potency score. It repeats this step for multiple iterations, until it finds the highest predicted potency score. Then, the model finally decodes a new molecule from the updated vector, with modified structure, by compiling all the corresponding clusters.

Valid and more potent

The researchers trained their model on 250,000 molecular graphs from the ZINC database, a collection of 3-D molecular structures available for public use. They tested the model on tasks to generate valid molecules, find the best lead molecules, and design novel molecules with increase potencies.

In the first test, the researchers’ model generated 100 percent chemically valid molecules from a sample distribution, compared to SMILES models that generated 43 percent valid molecules from the same distribution.

The second test involved two tasks. First, the model searched the entire collection of molecules to find the best lead molecule for the desired properties — solubility and synthetic accessibility. In that task, the model found a lead molecule with a 30 percent higher potency than traditional systems. The second task involved modifying 800 molecules for higher potency, but are structurally similar to the lead molecule. In doing so, the model created new molecules, closely resembling the lead’s structure, averaging a more than 80 percent improvement in potency.

The researchers next aim to test the model on more properties, beyond solubility, which are more therapeutically relevant. That, however, requires more data. “Pharmaceutical companies are more interested in properties that fight against biological targets, but they have less data on those. A challenge is developing a model that can work with a limited amount of training data,” Jin says.

Faster analysis of medical images

Algorithm makes the process of comparing 3-D scans up to 1,000 times faster.

Medical image registration is a common technique that involves overlaying two images, such as magnetic resonance imaging (MRI) scans, to compare and analyze anatomical differences in great detail. If a patient has a brain tumor, for instance, doctors can overlap a brain scan from several months ago onto a more recent scan to analyze small changes in the tumor’s progress.

This process, however, can often take two hours or more, as traditional systems meticulously align each of potentially a million pixels in the combined scans. In a pair of upcoming conference papers, MIT researchers describe a machine-learning algorithm that can register brain scans and other 3-D images more than 1,000 times more quickly using novel learning techniques.

The algorithm works by “learning” while registering thousands of pairs of images. In doing so, it acquires information about how to align images and estimates some optimal alignment parameters. After training, it uses those parameters to map all pixels of one image to another, all at once. This reduces registration time to a minute or two using a normal computer, or less than a second using a GPU with comparable accuracy to state-of-the-art systems.

“The tasks of aligning a brain MRI shouldn’t be that different when you’re aligning one pair of brain MRIs or another,” says co-author on both papers Guha Balakrishnan, a graduate student in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Department of Engineering and Computer Science (EECS). “There is information you should be able to carry over in how you do the alignment. If you’re able to learn something from previous image registration, you can do a new task much faster and with the same accuracy.”

The papers are being presented at the Conference on Computer Vision and Pattern Recognition (CVPR), held this week, and at the Medical Image Computing and Computer Assisted Interventions Conference (MICCAI), held in September. Co-authors are: Adrian Dalca, a postdoc at Massachusetts General Hospital and CSAIL; Amy Zhao, a graduate student in CSAIL; Mert R. Sabuncu, a former CSAIL postdoc and now a professor at Cornell University; and John Guttag, the Dugald C. Jackson Professor in Electrical Engineering at MIT.

Retaining information

MRI scans are basically hundreds of stacked 2-D images that form massive 3-D images, called “volumes,” containing a million or more 3-D pixels, called “voxels.” Therefore, it’s very time-consuming to align all voxels in the first volume with those in the second. Moreover, scans can come from different machines and have different spatial orientations, meaning matching voxels is even more computationally complex.

“You have two different images of two different brains, put them on top of each other, and you start wiggling one until one fits the other. Mathematically, this optimization procedure takes a long time,” says Dalca, senior author on the CVPR paper and lead author on the MICCAI paper.

This process becomes particularly slow when analyzing scans from large populations. Neuroscientists analyzing variations in brain structures across hundreds of patients with a particular disease or condition, for instance, could potentially take hundreds of hours.

That’s because those algorithms have one major flaw: They never learn. After each registration, they dismiss all data pertaining to voxel location. “Essentially, they start from scratch given a new pair of images,” Balakrishnan says. “After 100 registrations, you should have learned something from the alignment. That’s what we leverage.”

The researchers’ algorithm, called “VoxelMorph,” is powered by a convolutional neural network (CNN), a machine-learning approach commonly used for image processing. These networks consist of many nodes that process image and other information across several layers of computation.

In the CVPR paper, the researchers trained their algorithm on 7,000 publicly available MRI brain scans and then tested it on 250 additional scans.

During training, brain scans were fed into the algorithm in pairs. Using a CNN and modified computation layer called a spatial transformer, the method captures similarities of voxels in one MRI scan with voxels in the other scan. In doing so, the algorithm learns information about groups of voxels — such as anatomical shapes common to both scans — which it uses to calculate optimized parameters that can be applied to any scan pair.

When fed two new scans, a simple mathematical “function” uses those optimized parameters to rapidly calculate the exact alignment of every voxel in both scans. In short, the algorithm’s CNN component gains all necessary information during training so that, during each new registration, the entire registration can be executed using one, easily computable function evaluation.

The researchers found their algorithm could accurately register all of their 250 test brain scans — those registered after the training set — within two minutes using a traditional central processing unit, and in under one second using a graphics processing unit.

Importantly, the algorithm is “unsupervised,” meaning it doesn’t require additional information beyond image data. Some registration algorithms incorporate CNN models but require a “ground truth,” meaning another traditional algorithm is first run to compute accurate registrations. The researchers’ algorithm maintains its accuracy without that data.

The MICCAI paper develops a refined VoxelMorph algorithm that “says how sure we are about each registration,” Balakrishnan says. It also guarantees the registration “smoothness,” meaning it doesn’t produce folds, holes, or general distortions in the composite image. The paper presents a mathematical model that validates the algorithm’s accuracy using something called a Dice score, a standard metric to evaluate the accuracy of overlapped images. Across 17 brain regions, the refined VoxelMorph algorithm scored the same accuracy as a commonly used state-of-the-art registration algorithm, while providing runtime and methodological improvements.

Beyond brain scans

The speedy algorithm has a wide range of potential applications in addition to analyzing brain scans, the researchers say. MIT colleagues, for instance, are currently running the algorithm on lung images.

The algorithm could also pave the way for image registration during operations. Various scans of different qualities and speeds are currently used before or during some surgeries. But those images are not registered until after the operation. When resecting a brain tumor, for instance, surgeons sometimes scan a patient’s brain before and after surgery to see if they’ve removed all the tumor. If any bit remains, they’re back in the operating room.

With the new algorithm, Dalca says, surgeons could potentially register scans in near real-time, getting a much clearer picture on their progress. “Today, they can’t really overlap the images during surgery, because it will take two hours, and the surgery is ongoing” he says. “However, if it only takes a second, you can imagine that it could be feasible.”

"There is a ton of work using existing deep learning frameworks/loss functions with little creativity or imagination. This work departs from that mass of research with a very clever formulation of nonlinear warping as a learning problem ... [where] learning takes hours, but applying the network takes seconds," says Bruce Fischl, a professor in radiology at Harvard Medical School and a neuroscientist at Massachusetts General Hospital. "This is a case where a big enough quantitative change [of image registration] — from hours to seconds — becomes a qualitative one, opening up new possibilities such as running the algorithm during a scan session while a patient is still in the scanner, enabling clinical decision making about what types of data needs to be acquired and where in the brain it should be focused without forcing the patient to come back days or weeks later."

Fischl adds that his lab, which develops open-source software tools for neuroimaging analysis, hopes to use the algorithm soon. "Our biggest drawback is the length of time it takes us to analyze a dataset, and by far the more computational intensive portion of that analysis is nonlinear warping, so these tools are of great interest to me," he says.

The autonomous “selfie drone”

Alumni’s video-capturing drone tracks moving subjects while freely navigating any environment.

If you’re a rock climber, hiker, runner, dancer, or anyone who likes recording themselves while in motion, a personal drone companion can now do all the filming for you — completely autonomously.

Skydio, a San Francisco-based startup founded by three MIT alumni, is commercializing an autonomous video-capturing drone — dubbed by some as the “selfie drone” — that tracks and films a subject, while freely navigating any environment.

Called R1, the drone is equipped with 13 cameras that capture omnidirectional video. It launches and lands through an app — or by itself. On the app, the R1 can also be preset to certain filming and flying conditions or be controlled manually.

The concept for the R1 started taking shape almost a decade ago at MIT, where the co-founders — Adam Bry SM ’12, Abraham Bacharach PhD ’12, and Matt Donahoe SM ’11 — first met and worked on advanced, prize-winning autonomous drones. Skydio launched in 2014 and is releasing the R1 to consumers this week.

“Our goal with our first product is to deliver on the promise of an autonomous flying camera that understands where you are, understands the scene around it, and can move itself to capture amazing video you wouldn’t otherwise be able to get,” says Bry, co-founder and CEO of Skydio.

Deep understanding

Existing drones, Bry says, generally require a human pilot. Some offer pilot-assist features that aid the human controller. But that’s the equivalent of having a car with adaptive cruise control — which automatically adjusts vehicle speed to maintain a safe distance from the cars ahead, Bry says. Skydio, on the other hand, “is like a driverless car with level-four autonomy,” he says, referring to the second-highest level of vehicle automation.

R1’s system integrates advanced algorithm components spanning perception, planning, and control, which give it unique intelligence “that’s analogous to how a person would navigate an environment,” Bry says.

On the perception side, the system uses computer vision to determine the location of objects. Using a deep neural network, it compiles information on each object and identifies each individual by, say, clothing and size. “For each person it sees, it builds up a unique visual identification to tell people apart and stays focused on the right person,” Bry says.

That data feeds into a motion-planning system, which pinpoints a subject’s location and predicts their next move. It also recognizes maneuvering limits in one area to optimize filming. “All information is constantly traded off and balanced … to capture a smooth video,” Bry says.

Finally, the control system takes all information to execute the drone’s plan in real time. “No other system has this depth of understanding,” Bry says. Others may have one or two components, “but none has a full, end-to-end, autonomous [software] stack designed and integrated together.”

For users, the end result, Bry says, is a drone that’s as simple to use as a camera app: “If you’re comfortable taking pictures with your iPhone, you should be comfortable using R1 to capture video.”

A user places the drone on the ground or in their hand, and swipes up on the Skydio app. (A manual control option is also available.) The R1 lifts off, identifies the user, and begins recording and tracking. From there, it operates completely autonomously, staying anywhere from 10 feet to 30 feet away from a subject, autonomously, or 300 feet away, manually, depending on Wi-Fi availability.

When batteries run low, the app alerts the user. Should the user not respond, the drone will find a flat place to land itself. After the flight — which can last about 16 minutes, depending on speed and use — users can store captured video or upload it to social media.

Through the app, users can also switch between several cinematic modes. For instance, with “stadium mode,” for field sports, the drone stays above and moves around the action, following selected subjects. Users can also direct the drone where to fly (in front, to the side, or constantly orbiting). “These are areas we’re now working on to add more capabilities,” Bry says.

The lightweight drone can fit into an average backpack and runs about $2,500.

Skydio takes wing

Bry came to MIT in 2009, “when it was first possible to take a [hobby] airplane and put super powerful computers and sensors on it,” he says.

He joined the Robust Robotics Group, led by Nick Roy, an expert in drone autonomy. There, he met Bacharach, now Skydio’s chief technology officer, who that year was on a team that won the Association for Unmanned Vehicles International contest with an autonomous minihelicopter that navigated the aftermath of a mock nuclear meltdown. Donahoe was a friend and graduate student at the MIT Media Lab at the time.

In 2012, Bry and Bacharach helped develop autonomous-control algorithms that could calculate a plane’s trajectory and determine its “state” — its location, physical orientation, velocity, and acceleration. In a series of test flights, a drone running their algorithms maneuvered around pillars in the parking garage under MIT’s Stata Center and through the Johnson Athletic Center.

These experiences were the seeds of Skydio, Bry says: “The foundation of the [Skydio] technology, and how all the technology works and the recipe for how all of it comes together, all started at MIT.”

After graduation, in 2012, Bry and Bacharach took jobs in industry, landing at Google’s Project Wing delivery-drone initiative — a couple years before Roy was tapped by Google to helm the project. Seeing a need for autonomy in drones, in 2014, Bry, Bacharach, and Donahoe founded Skydio to fulfill a vision that “drones [can have] enormous potential across industries and applications,” Bry says.

For the first year, the three co-founders worked out of Bacharach’s dad’s basement, getting “free rent in exchange for helping out with yard work,” Bry says. Working with off-the-shelf hardware, the team built a “pretty ugly” prototype. “We started with a [quadcopter] frame and put a media center computer on it and a USB camera. Duct tape was holding everything together,” Bry says.

But that prototype landed the startup a seed round of $3 million in 2015. Additional funding rounds over the next few years — more than $70 million in total — helped the startup hire engineers from MIT, Google, Apple, Tesla, and other top tech firms.

Over the years, the startup refined the drone and tested it in countries around the world — experimenting with high and low altitudes, heavy snow, fast winds, and extreme high and low temperatures. “We’ve really tried to bang on the system pretty hard to validate it,” Bry says.

Athletes, artists, inspections

Early buyers of Skydio’s first product are primarily athletes and outdoor enthusiasts who record races, training, or performances. For instance, Skydio has worked with Mikel Thomas, Olympic hurdler from Trinidad and Tobago, who used the R1 to analyze his form.

Artists, however, are also interested, Bry adds: “There’s a creative element to it. We’ve had people make music videos. It was themselves in a driveway or forest. They dance and move around and the camera will respond to them and create cool content that would otherwise be impossible to get.”

In the future, Skydio hopes to find other applications, such as inspecting commercial real estate, power lines, and energy infrastructure for damage. “People have talked about using drones for these things, but they have to be manually flown and it’s not scalable or reliable,” Bry says. “We’re going in the direction of sleek, birdlike devices that are quiet, reliable, and intelligent, and that people are comfortable using on a daily basis.”