Andrey Gavrilov: Move from Pandas to Spark. Adaptation of machine learning models to work in a distributed environment
Move from Pandas to Spark. Adaptation of machine learning models to work in a distributed environment.
Nowadays, Data Science and Big Data are two the most frequent buzzwords in the scope of working with data. Data Science is mainly devoted to the information analysis and making a data-driven decision, while, Big Data is about processing large volumes of data and integrating it into a manageable form. The adjacent area of solutions makes it natural to increase the volume of tasks related with the intersection of these two approaches. In other words, engineers are increasingly faced with the challenge of operationalizing ML models. The objective often consists in adapting ML models to work in a distributed environment.
Approaches to solving the problem of replacing implementations of machine learning algorithms with distributed analogues are presented in the report. In particular, group of related models that are used to produce word embeddings Word2vec (Gensim) are compared with an analogue from the distributed machine learning library MLlib (PySpark). A comparative analysis of the results of the singular decomposition procedure for implementations from PySpark MLlib and Scikit-learn (TruncatedSVD) is carried out. The issues of distributed (in HDInsight cluster) training of neural networks implemented using the Keras library (TensorFlow) are considered.
Key Words: Data Science, Big Data, Python, Spark, PySpark, MLlib, Word2vec, Scikit-learn, Keras, TensorFlow, Neural networks, SVD
Andrey Gavrilov
St. Petersburg, Russia
Big Data Software Engineer
EPAM
Work with Big Data and Data Science in EPAM. I studied Data Science in Peter the Great St. Petersburg Polytechnic University in department of Applied Math. I am interested in Python game-dev as well as information security.