The whole Data Science / Jupyter ecosystem reminds me so much of the old days of...

mrtranscendence · on May 23, 2022

> Data scientists need to be trained with software engineering skills, software engineers need to be trained with data science skills.

It's a nice idea, but it turns out to be a pretty big ask. Particularly at my employer, where a large proportion of new hires come straight out of college. Data scientists have usually studied something like economics, math, statistics, or physics; most of them haven't been introduced to software engineering at all. We try to bring new hires up to speed, but there's only so much we can do with a series of relatively short sessions on Python and git.

Similarly, software engineers don't necessarily have the requisite background to understand the kind of work data scientists do. They'll have had a few semesters of calculus but it's likely they won't have had much if any exposure to data analysis or machine learning. They might not have even had a stats course in college. Further, in my experience they have had little inclination to understand how data scientists work, nor how their software products may or may not fit data scientists' needs.

Opining for a moment here ...

I've had the privilege of working for a few years at a position that kind of straddles the line between data scientist and software engineer (though I was technically a data scientist), and part of that job was mentorship and training. Getting good code out of data scientists and software engineers can be tough to do. I've seen nearly as much messy, uncommented, unformatted, unoptimized code from engineers as I have from data scientists, it's just that when I make recommendations to data scientists they'll actually listen to me.

I'm just lucky engineers started finally using the internal libraries I maintain rather than their own questionable alternatives (though if I never have the "why aren't you pinning exact dependencies for your library? My code broke!" discussion again it'll be too soon.)

gregsadetsky · on May 23, 2022

Would you have reccomendations / pointers for this?

I straddle both worlds - I’m much more on the tech side, but sometimes interact with scientists / MATLAB codebases.

What data science skills / methdologies would be useful for me to learn?

P.S. And what did you mean by MATLAB to C++? That was a specific time frame (I suppose in the early 2000s ish) when C++ was taught to scientists in the hope they’d be able to productionize their MATLAB code? With not great results (i.e. C++ learning curve + lack of software engineering skills…?) Thanks!

rmnclmnt · on May 23, 2022

From my experience in the fields of DSP / Data / AI during the last 10 years, issues arise when product teams are segregated into jobs (one guy for initial prototype, then one guy to prep the integration, then one guy for the ops side of things, etc.): people need to be interested and involved in the product they are building end-to-end! Yes this is more demanding, yes this requires perpetual training, but gosh it is rewarding!

My take (non-exhaustive) with the current ecosystem is to apply Agile and DevOps methodologies: - Use Git everywhere all the time, always - Use Jupyter early one: great for quick prototypes & demos, keynotes, training material, articles - Once the initial prototype is approved, archive Jupyter notebooks as snapshots - Write functional tests (ideally in a TDD fashion) - Build and/or integrate the work into a real software product, be it in Python, C++, Java, etc. - Use tools for deterministic behavior (package manager, Docker, etc.) - Use CI/CD and GitOps methodologies - Deliver iteratively, fast and reliably

And by "MATLAB to C++" was a reference to a time (2010's) when corporation were deeply involved with MATLAB licenses, could not afford to switch easily to Python and lots of SWE with applied math background had to deal with MATLAB code written by pure math guys without any consideration for SWE best practices and target product constraints. Nowadays, if the target product is also in Python, there is way less friction, hopefully :)

gusgordon · on May 23, 2022

What’s your recommendation in terms of tooling for cases where it’s not just prototype -> production, but an iterative process? I love notebooks for prototyping, but I find it’s a lot of work to make sure notebook code and prod code are in sync. Maybe just debugging with IPython?

rmnclmnt · on May 23, 2022

When you've "productionized" a part of your notebook into a Python module, refactor your notebook to use the module instead. Usually, the notebook code will shrink by 80% and will switch to model documentation and demo.

gusgordon · on May 23, 2022

Yeah, that’s basically what I do, but I often find I need to play around with intermediary data within functions.

apwheele · on May 23, 2022

I create my own classes for this. (Essentially to do the same thing as sklearn pipelines, but I like creating my own classes just for this debugging/slowly expand functionality reason.) Something like:

class mymodel(): ... def feature_engineering(self): ... def impute_missing(self): ... def fit(self) ... def predict_proba(self) ...

Then it is pretty trivial to test with new data. And you can parameterize the things you want, e.g. init fit method as random forest or xgboost, or stack on different feature engineering, etc. And for debugging you can extract/step through the individual methods.

rmnclmnt · on May 23, 2022

This is a blind guess here, but if you need to inspect the inner data of your function after writing it, it might mean the scope is too broad and your function could be split?

This is where standard SWE guidelines could be of help (function interfaces, contracts definition, etc)

dragonwriter · on May 24, 2022

> Data scientists need to be trained with software engineering skills, software engineers need to be trained with data science skills

That's kind of like saying the solution to liability issues arising in the practice of medicine is for physicians to learn lawyer skills and lawyers to learn physician skills.

It's a great idea, if you ignore the costs to get people trained and the narrowing of the pool for each affected profession.

Heck, it's hard enough to get software engineers who work almost entirely on systems where a major part of the lifting is done by an RDBMS to learn database skills.

mountainriver · on May 23, 2022

Yeah training data scientists seems like the answer but in reality it’s just not feasible most of the time. Data science is really hard, and good engineering is really hard. Very few people can do both well.