Eric Ma, Principal Data Scientist at Moderna,  supporting research data science. Prior to Moderna, he was at the Novartis Institutes for Biomedical Research conducting biomedical data science research with a focus on using Bayesian statistical methods in the service of making medicines for patients. We recently had the chance to ask him a few questions about his time at Moderna and Novartis, his creation of pyjanitor and nyviz, and about biomedical data science.

Q1: It looks like you’ve worked in a few medical/pharmaceutical companies – what’s trending there? How have things changed because of COVID?

I’m actually only on my second role out of graduate school – started at the Novartis Institutes for BioMedical Research, and now I’m at Moderna Therapeutics. In both roles I have been a data scientist – though if we were to dig more granularly, I’ve played multiple roles, including algorithm developer, statistician, database wrangler, and more. I’ll tackle this question from the hiring perspective, because that’s been on my mind most recently.

From my observations on LinkedIn, I have seen a growth in the desire for biomedical data science. When I first started at NIBR, I noticed that the roles were often very generally defined. Nowadays, however, they are becoming more well-defined within domains: rather than seeing positions open for a “Data Scientist”, you will see things like “Genomics Data Scientist” or “Oncology Data Scientist”. I believe that is coming from a recognition that having domain knowledge, and in particular, a practical and working knowledge of (1) how to handle commonly used data formats and (2) the data generating processes for data from a given domain, can be a valuable skill for someone new joining a team. There’s a much lower probability that they will make elementary mistakes early on, and a much higher probability that they will be able to hit the ground running on important projects.

One thing that I can speak to from personal experience – while the vast majority of data science roles can be performed remotely, in the life sciences where there may be interfacing with experimental teams that can’t be remote, the interaction is always much more lubricated when done in-person. In-person whiteboarding is one thing that is very difficult to replicate in a digital setting. Additionally, interfacing with experimentalist teams is where prior knowledge of how experiments are conducted (and hence how the data are generated) can be immensely helpful: if a data scientist is called to help with experimental design, if that data scientist happens to have prior working experience, they can design experiments that work within experimentalist constraints without much guidance needed.

Q2: What are some interesting ways or use cases of you using Bayesian statistics in your job?

For one of the projects that I worked on at NIBR (it is under review at a journal at the moment, and as such I am comfortable writing about it), I used hierarchical Bayesian estimation models to estimate true enzyme activity in high throughput measurement data. This was one for which we found very little precedent in the literature, and which also struck me as being odd. After all, high throughput measurement data is incredibly noisy, and the best way to account for uncertainty that arises from measurement error is through Bayesian methods!

One awesome feature of hierarchical models is the built-in regularization that we get. It gives the natural effect of us expressing the idea that “we don’t really believe extreme values unless they’re reproducibly measured”. We know that high throughput measurements are notoriously noisy, meaning that we might get extreme measurements by pure random chance. Hierarchical Bayesian models help us model that uncertainty when we are trying to estimate the true value of a measurement. while also helping us to avoid being fooled by those extreme values.

Q3: You made pyjanitor – what inspired you to create it? What problems were you aiming to solve that other APIs couldn’t?

It really started when an ex-colleague of mine at NIBR, Brant Peterson, swung his laptop around and pointed to the R package “Janitor”. The one function I remember him showing me was called, clean_names. My first reaction was, “I can totally make that happen in Python and pandas!” Upon taking a deeper second look at the library, I realized that the idea of collecting a library of dataframe cleaning functions makes total sense! There are routinely and commonly-used chunks of code that I use for preparing dataframes for downstream use. That was how the library initially started – like most other libraries, it was scratching an itch that I needed.

As the project developed, I became more and more interested in this method chaining paradigm, which were used in other projects too. There were similar projects but none of them aimed to provide a collection of well-defined convenience functions. I think that was what really became the pyjanitor project’s biggest attraction. The library of functions really was a library in its truest sense: practical knowledge of data cleaning housed in a collection that anyone can tap into.

There was one other inspiration for the project: early on, I realized that software tools were amongst the most directly impactful pieces of work, and I wanted to get practice with software tooling. pyjanitor has been an awesome practice ground for myself and for the rest of the dev team too.

Q4: Same thing for nxviz ^

nxviz was also a project borne out of an itch to scratch. I wrote a blog post that includes a history of nxviz, which you can read here. The motivations were similar to pyjanitor, except applied instead to network visualization.

Editor’s Note:

Eric will be teaching a tutorial on Network Analysis on the ODSC AI+ platform – learn more about NetworkX, applied network science, and graph visualization up there! He will cover the basics of graph theory, how to use NetworkX, and how a variety of problems can be solved using graphs as a central data structure. You’ll walk away with a solid grounding in NetworkX and applied network science; there will surely be seeds of inspiration for your downstream work!

About Eric Ma:

Eric Ma is a Principal Data Scientist at Moderna supporting research data science. Prior to Moderna, he was at the Novartis Institutes for Biomedical Research conducting biomedical data science research with a focus on using Bayesian statistical methods in the service of making medicines for patients. Prior to Novartis, he was an Insight Health Data Fellow in the summer of 2017 and defended his doctoral thesis in the Department of Biological Engineering at MIT in the spring of 2017.

Eric is also an open-source software developer and has led the development of pyjanitor, a clean API for cleaning data in Python, and nxviz, a visualization package for NetworkX. In addition, he gives back to the open-source community through code contributions to multiple projects.