Biggest Pain in Machine Learning? Dirty Spreadsheet Data

If you imagine the life of a machine learning researcher, you might think it’s quite glamorous. You’ll program self-driving cars, work for the biggest names in tech, and your software could even lead to the downfall of humanity. So cool! But, as a new survey of data scientists and machine learners shows, those expectations need adjusting, because the biggest challenge in these professions is something quite mundane: cleaning dirty data.

This comes from a survey conducted by data science community Kaggle (which was acquired by Google earlier this year). Some 16,700 of the site’s 1.3 million members responded to the questionnaire, and when asked about the biggest barriers faced at work, the most common answer was “dirty data,” followed by a lack of talent in the field.

But what exactly is dirty data, and why is it such a problem?

Al Chen is an Excel aficionado. Watch as he shows you how to clean up raw data for processing in Excel. This is also a great resource for data visualization projects.

It’s axiomatic to say that data is the new oil of the digital economy, but this is especially true in fields like machine learning. Contemporary AI systems generally learn by example, so if you show one a lot of pictures of a cat, over time it’ll start to recognize characteristics that constitute ‘cattyness’. This is why companies like Google and Amazon have been able to build such effective image and speech recognition platforms: they have a ton of data from users.

“There’s the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data, In reality, it really varies. But data cleaning is a much higher proportion of data science than an outsider would expect. Actually, training models is typically a relatively small proportion (less than 10 percent) of what a machine learner or data scientist does.”

— Anthony Goldbloom, Kaggle founder and CEO

But AI systems are still computer programs, which means they’re prone to flipping out if you press the wrong button at the wrong time. This inflexibility includes the data they can learn from. Think of these programs like fussy infants who refuse to eat unless their bananas are mashed just so. But instead of prepping bananas, workers in the field have to comb through datasets with hundreds of thousands of entries, tracking down missing values and remove any formatting errors. Making aeroplane noises while they do so is optional.

Read the Full Article