To Share or Not to Share: That is the Big Data Question

Between the disclosures this year about Facebook’s lax data sharing policies and the European Union’s GDPR (General Data Protection Regulation), a lot of people are talking about data privacy and consumer rights. How much data should you share as a consumer with companies like Facebook or Google?

But what about businesses?

Enterprise organizations may be dealing with their own data privacy dilemma — should they share their corporate data with partners or with vendors or with some other organization? If so, what data is OK to share, and what should they keep as private and proprietary? After all, data is the new oil. Amazon, Facebook, and Google have all built multi-billion dollar companies by collecting and leveraging data.

Although it is one of the top assets a company may have, there may be compelling reasons to share data, too. For instance, leading edge cancer centers could potentially speed up and advance society’s effort to cure cancer if they shared the data that each of them collected. But sharing it with a competitor could also erode their own competitive edge in the market.

Organizations may also be considering participation in a vendor program such as one under development at SAP called Data Intelligence that will anonymize enterprise customer data and allow those customers to benchmark themselves against the rest of the market.

“People are realizing that the data they have has some value, either for internal purposes or selling to a data partner, and that is leading to more awareness of how they can share data anonymously,” Mike Flannagan of SAP told InformationWeek in an interview earlier this year. He said that different companies are at different levels of maturity in terms of how they think about their data.

Even if you share data that has been anonymized in order to train an algorithm, the question remains whether you are giving away your competitive edge when you share your anonymized data assets. Organizations need to be careful.

“Data is extremely valuable,” said Ali Ghodsi, co-founder and CEO of Databricks (the big data platform with its origins offering hosted Spark) and an adjunct professor at the University of California, Berkeley. In Ghodsi’s experience, organizations don’t want to share their data, but they are willing to sell access to it. For instance, organizations might sell limited access to particular data sets for a finite period of time.

Data aggregators are companies that will create data sets to sell by scraping the web, Ghodsi said.

Then there are older companies that may have years or decades of data that have not been exposed yet to applied AI and machine learning, Ghodsi said, and those companies may hope to use those gigantic data sets to catch up and gain a competitive edge. For instance, any retailer with a loyalty card may have aggregated data over 10 or 20 years.

In Ghodsi’s experience, organizations want more data, but they are unwilling to share it, sometimes even within their own organizations. In many organizations, IT controls access to the data and may not always be willing to say yes to all the requests from data scientists in the line-of-business areas. That’s among the topics in a December 2017 paper co-authored by Ghodsi and other researchers from UC Berkeley titled A Berkeley View of Systems Challenges for AI. Ghodsi said that the group is doing research to find ways in which you can incentivize companies to share more of their data. One of the ways is in the model itself — the machine learning model is a very compact summary of all the data.

Read the source article in Information Week.