Email:info@araniconsulting.com
- Unit 1: Introduction to Data Science
- What is Data Science, what does a data scientist do, various examples of Data Science in the industries and how Python is deployed for Data Science applications, various steps in Data Science process like data wrangling, data exploration and selecting the model, understanding data visualization, what is exploratory data analysis and building of hypothesis, plotting and other techniques.
- Introduction to Python programming language, important Python features, how is Python different from other programming languages, Python installation, Anaconda Python distribution for Windows, Linux and Mac, how to run a sample Python script, Python IDE working mechanism, running some Python basic commands, Python variables, data types, and keywords.
- Introduction to a basic construct in Python, understanding indentation like tabs and spaces, code comments like Pound # character, names and variables, Python built-in data types like containers (list, set, tuple and dict), numeric (float, complex, int), text sequence (string), constants (true, false, ellipsis) and others (classes, instances, modules, exceptions and more), basic operators in Python like logical, bitwise, assignment, comparison and more, slicing and the slice operator, loop and control statements like break, if, for, continue, else, range() and more.
- Understanding the OOP paradigm like encapsulation, inheritance, polymorphism, and abstraction, what are access modifiers, instances, class members, classes and objects, function parameter and return type functions, Lambda expressions, connecting with database to pull the data.
- Introduction to mathematical computing in Python, what are arrays and matrices, array indexing, array math, ND-array object, datatypes, standard deviation, conditional probability in NumPy, correlation, covariance
- Introduction to SciPy, building on top of NumPy, what are the characteristics of SciPy, various sub-packages for SciPy like Signal, Integrate, Fftpack, Cluster, Optimize, Stats and more, Bayes Theorem with SciPy.
- Introduction to Machine Learning with Python, various tools in Python used for Machine Learning like NumPy, Scikit-Learn, Pandas, Matplotlib and more, use cases of Machine Learning, process flow of Machine Learning, various categories of Machine Learning, understanding Linear Regression and Logistic Regression, what is gradient descent in Machine Learning, introduction to Python DataFrames, importing data from JSON, CSV, Excel, SQL database, NumPy array to DataFrame, various data operations like selecting, filtering, sorting, viewing, joining and combining, how to handle missing values, time series analysis.
- What is a data object and its basic functionalities, using Pandas library for data manipulation, NumPy dependency of Pandas library, loading and handling data with Pandas, how to merge data objects, concatenation and various types of joins on data objects, exploring and analyzing datasets.
- Using Matplotlib for plotting graphs and charts like Scatter, Bar, Pie, Line, Histogram and more, Matplotlib API, Subplots and Pandas built-in data visualization.
- What is supervised learning, classification, Decision Tree, algorithm for Decision Tree induction, Confusion Matrix, Random Forest, Naïve Bayes, working of Naïve Bayes, how to implement Naïve Bayes Classifier, Support Vector Machine, working process of Support Vector Mechanism, what is Hyperparameter Optimization, comparing Random Search with Grid Search, how to implement Support Vector Machine for classification.
- Introduction to unsupervised learning, use cases of unsupervised learning, what is K-means clustering, understanding the K-means clustering algorithm, optimal clustering, hierarchical clustering, and K-means clustering and how does hierarchical clustering work, what is natural language processing, working with NLP on text data, setting up the environment using Jupyter Notebook, analyzing sentence, the Scikit-Learn Machine Learning algorithms, bags of words model, extracting feature from text, searching a grid, model training, multiple parameters and building of a pipeline.
- Introduction to web scraping in Python, various web scraping libraries, BeautifulSoup, Scrapy Python packages, installing of BeautifulSoup, installing Python parser lxml, creating soup object with input HTML, searching of the tree, full or partial parsing, output print and searching the tree.
- What is the need for integrating Python with Hadoop and Spark, the basics of the Hadoop ecosystem, Hadoop Common, the architecture of MapReduce and HDFS and deploying Python coding for MapReduce jobs on Hadoop framework, understanding Apache Spark, setting up Cloudera QuickStart VM, Spark tools, RDD in Spark, PySpark, integrating PySpark with Jupyter Notebook, introduction to Artificial Intelligence and Deep Learning, deploying Spark code with Python, the Machine Learning library of Spark MLlib, deploying Spark MLlib for classification, clustering and regression.
- Industry: GeneralProblem Statement: How to analyze the trends and most popular baby names
- Topics: In this Python project you will work with the United States Social Security Administra4on (SSA) has made available data on the frequency of baby names from 1880 through 2016. The project requires analyzing the data considering different methods. You will visualize the most frequent names, determine the naming trends, and come up with the most popular names for a certain year.
- Highlights :
- Analyzing data using Pandas Library
- Deploying Data Frame Manipulation
- Bar & box plots with MatPlotLib
- In this project you will be introduced to the process of web scraping using Python. It involves installation of Beautiful Soup, web scraping libraries, working on common data and page format on the web, learning the important kinds of objects, Navigable String, deploying the searching tree, navigation options, parser, search tree, searching by CSS class, list, function and keyword argument.
- Industry: TelecommunicationsProblem Statement: How to increase the profitability of a telecom major by reducing the churn rate
- Topics: In this project, you will work with the telecom company’s customer dataset. This dataset includes subscribing to telephone customer’s details. Each of the columns has data on phone number, call minutes during various times of the day, the charges incurred, lifetime account duration, whether or not the customer has churned some services by unsubscribing it. The goal is to predict whether a customer will eventually churn or not.
- Highlights:
- Deploy Scikit-learn ML library
- Develop code with Jupyter Notebook
- Build a model using a performance matrix.
- Objective: This includes the process of loading the server logs into the cluster using Flume. It can then be refined using Pig Script, Ambari and HCatlog. You can then visualize it using elastic search and excel. This project task includes:- Server logs- Potential uses of server log data
- Pig script
- Firewall logs
- Workflow editor