By: Rachel Brodkin (Praelexis Intern, 2022)
Praelexis runs an exciting internship program. Twice a year (June/July and Dec/Jan) we are joined by bright young minds and the mutual knowledge sharing commences. Interns at Praelexis join our teams and work on real industry problems, while upskilling for their careers as Data Scientists. We have asked Rachel Brodkin to share some of her impressions after joining Praelexis as an intern. Rachel holds a BSc (Hons) degree in Statistics from UCT and it has been tremendous having her in the office. But let’s hear it from Rachel herself:
The Practical Tooling and Working in SWAT
Interning at Praelexis provides hands-on exposure to the practical skills needed to become an excellent data scientist. These skills are both technical and non-technical and include:
- Proficient programming in Python,
- Building data pipelines and writing Structured Query Language (SQL) queries,
- Version control using Git,
- Strong communication skills when explaining the value of a model to the client and ensuring alignment (agile and Test-Driven Development are tools to aid in this),
- Extensive Machine Learning domain knowledge, and
- Working with cloud resources (such as AWS and Azure).
One of the many teams at Praelexis is SWAT. The nature of the SWAT team is to be comfortable with constantly alternating between new and exciting projects. This requires continuous innovation, creativity and efficient time management. Each project begins with identifying the core need of the client and proposing a bespoke end-to-end Machine Learning solution to solve the problem and delight the customer. Explicitly defining the target of the customer and setting a goal for each sprint ensure secure alignment within the team as well as with the client. The culture within Praelexis is warm, collaborative, light-hearted and a place where everyone can feel comfortable to reach out and ask for help, upskill oneself and learn from others. With the rich diversity of people at Praelexis, it allows for different perspectives to tackle complex problems. Collaborative learning is one of the values of Praelexis.
Prior to interning, my experience studying at university was mostly limited to neatly organized and structured data which was then used to train and test the Machine Learning models. However, real-world collected data requires extensive cleaning and data wrangling. The goal is to ensure the data collected does not contain duplicates or is missing entries. Data Preprocessing for optimal quality is one of the most important steps in the process.
Structured Query Language (SQL) is used for performing various operations on the data stored in databases. SQL is also essential for carrying out data wrangling and will form part of the data pipeline. A data pipeline is the set of processes that convert raw data into simplified and understandable data. Pipelines automate this flow of data from start to end in order to bring forth precise business solutions.
Version Control, also known as source control, is the practice of tracking and managing changing to software code. The most widely used version control system is Git. Git is used to collaborate on code and work in parallel. Git keeps track of all prior versions of the code – storing this history helps to ensure a backup in case there is a merge conflict or an issue with the code. The Agile methodology of software development through Git is crucial to successful data science product development. The Praelexian way of work implements Agile methodologies to allow for continuous iteration of development and testing. The continuous collaboration in parallel between team members and project stakeholders is important for ensuring alignment with business goals. Agile ceremonies such as daily standups, retrospective meetings, sprint planning and sprint reviews are some of the tools used. Test Driven Development (TDD) ensures optimized code and fits in well with the Agile methodology.
The internship provided an immersive learning experience. The best way to learn is by doing and sometimes by being thrown in the deep end too. Programming in Python is one of the most important skills to have as a data scientist. Like most other academics or statisticians, my background was primarily in the R language. R is dedicated to statistics, data analyses and primarily used for data visualization. Python is a more a general-purpose language used for development and model deployment. While both languages are equipped with packages for deep learning, Python is faster than R when it comes to performance, and a better option for building applications with speed. The learning curve for learning Python from a R programming background was manageable. Python is similar to spoken English and is easy to grasp.
Working with Cloud Resources such as AWS, Azure and google clouds is practically useful and a necessary skill to have when working as a data scientist. Constantly upskilling oneself to master these tools is important.
The Machine Learning Project
Once grasping Python, the next step was to build a Machine Learning project. The task was to create an Optical Character Recognition (OCR) webapp. This web app must be able to receive an uploaded or and captured image and ‘read’ the text embedded in the photo. Natural Language Processing (NLP) techniques are applied to gain valuable insights from the text generated from the image. It seemed like a daunting task at the beginning. However, once broken up into steps, the project became more manageable.
In order to ensure optimal performance of a Machine Learning application, deploying and testing the model is crucial. One of the simplest ways to do this is through building web interfaces. An application programming interface (API) allows two applications to communicate with each other. It allows information to be manipulated by other programmers via the internet.
FastAPI is a high-performance python framework to test the functionality of models at production level. Bundled with its interactive documentation, the learning curve is lower than ever. An API endpoint is a point at which an API connects with the software program. APIs send requests for data from a web server or application and receive a response.
The OCR component of the API used two endpoints. One was a POST request to take a picture using the webcam. The other was a GET request of the text predicted by Tesseract. Tesseract is an open-source OCR engine, which is sponsored by Google since 2006. Tesseract is based on a LSTM (Long short-term memory) neural network.
Once the endpoint had successfully outputted the text embedded in the image, the next step was to use natural language programming (NLP) to further analyze the text.
The pre-processing part of NLP can be implemented in the following steps:
- Word tokenization
Word tokenization means breaking up text strings into smaller chunks like words or sentences.
- Stop word removal
Stop words are words that do not provide more information on the text paragraph itself and are therefore noise.
Stemming is a process of linguistic normalisation, which reduces words to their root word.
Lemmatization reduces words to their lemma.
After preprocessing (using the open-source Python package nltk) the remaining words were used to visualize the text paragraph. This was achieved through to constructing a frequency plot of the counts of words and creating a word cloud of the filtered words.
I learnt an abundance of technical and practical skills while interning at Praelexis. Some of these include using Python for Machine Learning, version control using Git, working with Tensorflow, building an optical character recognition (OCR) webapp from scratch using an API. I learnt that you never truly stop learning as a data scientist! Overall, my experience interning at Praelexis provided hands-on exposure to real-world data and Machine Learning. The company culture is engaging, collaborative and friendly. Surrounded by so much knowledge and expertise, provided a motivating environment and one which I am very grateful to have been a part of.