Institute for Data Science Aims to Democratize Supercomputing With NSF Grant
Ordinary people could soon have greater ability to analyze massive amounts of information, based on new algorithms and software tools being designed at NJIT, intended to simplify access to a programming interface from data scientists at the Department of Defense.
It's relatively straightforward to analyze data sets of up to several hundred gigabytes, as the required software is readily available to students and small businesses, but there's a higher barrier to entry for working with tens of terabytes, which generally requires extensive training on high-performance computers, Institute for Data Science Director David Bader explained.
Bader anticipates that his team's efforts, being designed with an award from the National Science Foundation, will greatly increase the user base for supercomputing especially among women, high school students and other underrepresented groups in STEM fields. Those groups tend to have the least access to that power today. If the user base increases, they'll demand even more tools, which could cause the industry to rethink their design motivations and democratize high-end computing systems.
To address this problem, Bader along with doctoral student Oliver Alvarado Rodriguez and research scientist Zhihui Du will spend the next year extending Arkouda, which is the defense-derived open-source code library written in Python, an everyday language taught as early as elementary school that's also used for serious applications. They will build new algorithms and software that adds capabilities for common data structures such as graphs, lists, strings and trees. The software will be designed for simple usability, which hasn't been a concern of most players in the high-performance computing field.
Bader anticipates that his team's efforts, being designed with an award from the National Science Foundation, will greatly increase the user base for supercomputing especially among women, high school students and other underrepresented groups in STEM fields
"We have a large number of data scientists that want to manipulate data sets that are terabytes in size, and that's been a challenging issue, but there hasn't been much thought to the tooling. In the past they may use Apache tools like Hadoop and Spark … but what's different now is we have a framework that will connect to Python, have a supercomputer in the background if you want it, and the data scientist doesn't need to know about it," Bader said.
Still, there's no such thing as a free lunch. "While our work with Arkouda will bridge the frustrating gap between practical data science and [high-performance computing] technology with this application, if you're handling data sets that are tens of terabytes, it will require a high-performance computer on the other side," Bader noted. "To get this magic to work, you would need a backend system that is capable of storing and processing your data set."
But, he continued, "The hard part is solved. A programmer need not learn to program a supercomputer … That's our development work, so they can stay in Python."
Bader added that Arkouda — named after the Greek word for bear — was developed by William Reus and Mike Merrill, with the former giving a virtual presentation open to all in the NJIT community on Wednesday, March 24.
Rodriguez, the doctoral student, received his undergraduate degree in computer science from William Paterson University but chose NJIT for his next step. "Dr. Bader reached out to me the week that everything closed down because of the pandemic. I did some research in undergrad in machine learning, so that started sparking my interest in that area," he said.
Rodriguez wants to become a professor one day. For now, he's enjoying the new research project. "Before I joined NJIT, I was more inclined toward doing cybersecurity research, but over the summer as I've done more work here, it's really sparked my interest in high-performance computing," he said. "Bridging that gap between laypeople and high-performance computing tools is a very important research focus."