Data Curation is for Big Data what Data Integration is for Small Data
Reporter: Aviva Lev-Ari, PhD, RN
Data Curation is for Big Data what Data Integration is for small data.
Tamr is an exciting new startup which wants to solve the data curation problem. It was co-founded in Fall 2012 as Data Tamer by two serial entrepreneurs – Michael Stonebraker, a legendary database researcher for whom it was a ninth startup, and Andy Palmer, who has been involved in founding and/or funding over 50 innovative companies. With such founders, the company has attracted a lot of financing – over $16 million from investors including Google Ventures and New Enterprise Associates (NEA), and a lot of attention, including a KDnuggets post Data Tamer startup from Michael Stonebraker, Still in Stealth Mode.
On May 19th, Data Tamer has emerged from stealth mode and renamed itself to Tamr.
Last week, I stopped by their offices in the heart of Harvard Square, Cambridge, and received a briefing from Andy Palmer, Tamr CEO, and his young team, including Alan Wagner and Nidhi Aggarwal.
Tamr’s approach to solving the Data Curation problem is designed to scale and to improve with more data. The key ideas are
1. Scalability through automation: The size of the integration problems precludes a human-centric solution. Machine Learning methods are needed.
2. Data Cleaning: Enterprise data sources are inevitably quite dirty.
3. Non-programmer orientation: Current Extract, Transform and Load (ETL) systems have scripting languages that are appropriate for professional programmers. The scale of next generation problems requires that less skilled employees be able to perform integration tasks.
4. Incremental: New data sources must be integrated incrementally as they are uncovered. Data Curation is never finished!
Tamr also smartly combines automation and human expertise.
It starts with using Machine Learning and Data Analysis algorithms to find relationships between data elements and tries to automate most data curation tasks. In cases when machine learning is not enough, it has well-defined processes and UI for asking human experts for help, and uses a smart rewards structure to encourage the experts.
SOURCE
http://www.kdnuggets.com/2014/05/tamr-new-frontier-big-data-curation.html
This is very insightful. There is no doubt that there is the bias you refer to. 42 years ago, when I was postdocing in biochemistry/enzymology before completing my residency in pathology, I knew that there were very influential mambers of the faculty, who also had large programs, and attracted exceptional students. My mentor, it was said (although he was a great writer), could draft a project on toilet paper and call the NIH. It can’t be true, but it was a time in our history preceding a great explosion. It is bizarre for me to read now about eNOS and iNOS, and about CaMKII-á, â, ã, ä – isoenzymes. They were overlooked during the search for the genome, so intermediary metabolism took a back seat. But the work on protein conformation, and on the mechanism of action of enzymes and ligand and coenzyme was just out there, and became more important with the research on signaling pathways. The work on the mechanism of pyridine nucleotide isoenzymes preceded the work by Burton Sobel on the MB isoenzyme in heart. The Vietnam War cut into the funding, and it has actually declined linearly since.
A few years later, I was an Associate Professor at a new Medical School and I submitted a proposal that was reviewed by the Chairman of Pharmacology, who was a former Director of NSF. He thought it was good enough. I was a pathologist and it went to a Biochemistry Review Committee. It was approved, but not funded. The verdict was that I would not be able to carry out the studies needed, and they would have approached it differently. A thousand young investigators are out there now with similar letters. I was told that the Department Chairmen have to build up their faculty. It’s harder now than then. So I filed for and received 3 patents based on my work at the suggestion of my brother-in-law. When I took it to Boehringer-Mannheim, they were actually clueless.
This is very insightful. There is no doubt that there is the bias you refer to. 42 years ago, when I was postdocing in biochemistry/enzymology before completing my residency in pathology, I knew that there were very influential mambers of the faculty, who also had large programs, and attracted exceptional students. My mentor, it was said (although he was a great writer), could draft a project on toilet paper and call the NIH. It can’t be true, but it was a time in our history preceding a great explosion. It is bizarre for me to read now about eNOS and iNOS, and about CaMKII-á, â, ã, ä – isoenzymes. They were overlooked during the search for the genome, so intermediary metabolism took a back seat. But the work on protein conformation, and on the mechanism of action of enzymes and ligand and coenzyme was just out there, and became more important with the research on signaling pathways. The work on the mechanism of pyridine nucleotide isoenzymes preceded the work by Burton Sobel on the MB isoenzyme in heart. The Vietnam War cut into the funding, and it has actually declined linearly since.
A few years later, I was an Associate Professor at a new Medical School and I submitted a proposal that was reviewed by the Chairman of Pharmacology, who was a former Director of NSF. He thought it was good enough. I was a pathologist and it went to a Biochemistry Review Committee. It was approved, but not funded. The verdict was that I would not be able to carry out the studies needed, and they would have approached it differently. A thousand young investigators are out there now with similar letters. I was told that the Department Chairmen have to build up their faculty. It’s harder now than then. So I filed for and received 3 patents based on my work at the suggestion of my brother-in-law. When I took it to Boehringer-Mannheim, they were actually clueless.