Orientation to Computing — II
Unit 1: Data Science & Big Data
From raw data to actionable dashboards — master the data science lifecycle, Big Data tools, and start earning by building real dashboards for Indian businesses.
⏱️ Time to Complete: 8–10 hours | 💰 Earning Potential: ₹5,000–₹15,000/month | 📝 30 MCQs (Bloom's Mapped)
💼 Jobs this unlocks: Data Analyst (₹4–6 LPA) | Junior Data Scientist (₹6–10 LPA) | MIS Executive (₹3–5 LPA)
Opening Hook — The Data Behind India's Digital Revolution
🏢 How Zomato Knows What You'll Order Before You Do
Every time you open Zomato, a complex data engine fires up. It analyses your past orders, time of day, weather in your city, trending restaurants nearby, and even how long you browse before ordering. This isn't magic — it's data science. Zomato processes over 2 million orders per day across 500+ Indian cities. Their recommendation engine — powered by Big Data analytics — increases order conversion by 35%.
Behind the scenes, teams of data analysts at Zomato's Gurugram HQ use tools like Hadoop, Spark, Tableau, and Python to process terabytes of data daily. They predict delivery times within 2-minute accuracy, optimise delivery partner routes, and decide which restaurant banner you see first.
What if YOU had built this? What if you could take raw data — messy, unstructured, massive — and turn it into insights that drive a ₹10,000 crore business? That's exactly what this chapter teaches you.
Learning Outcomes — Bloom's Taxonomy Mapped
| Bloom's Level | Learning Outcome |
|---|---|
| 🔵 Remember | List the 7 stages of the data science lifecycle and define Volume, Velocity, and Variety |
| 🔵 Understand | Explain how Hadoop's HDFS and MapReduce work together to process Big Data, using Indian examples |
| 🟢 Apply | Build a data dashboard in Google Sheets using real Indian census data with charts and filters |
| 🟢 Analyze | Compare Hadoop vs Spark vs cloud-based analytics and determine which suits different Indian business scenarios |
| 🟠 Evaluate | Assess the data privacy and ethical challenges in India's Aadhaar system and propose safeguards |
| 🟠 Create | Design a complete data pipeline proposal for a local Indian business, from data collection to dashboard delivery |
Concept Explanation — Data Science & Big Data from Scratch
1. The Data Science Lifecycle
Think of data science like cooking a meal. You don't just throw ingredients into a pot randomly. You first decide what to cook (problem definition), go to the market (data collection), wash and chop vegetables (data cleaning), taste and adjust (analysis), cook the dish (modelling), serve it (deployment), and get feedback (monitoring). Data science follows the exact same logical flow.
📊 The 7 Stages of the Data Science Lifecycle
Stage 1 — Problem Definition: What business question are we answering? Example: "Why are Swiggy deliveries taking 45+ minutes in Pune on weekends?"
Stage 2 — Data Collection: Gathering raw data from databases, APIs, web scraping, sensors, surveys. Swiggy collects GPS data, order timestamps, restaurant prep times, traffic data.
Stage 3 — Data Cleaning (Wrangling): Real data is messy. Missing values, duplicates, wrong formats. 60–80% of a data scientist's time goes here. Think of it as removing stones from rice before cooking.
Stage 4 — Exploratory Data Analysis (EDA): Visualise the data. Find patterns. "Aha, weekend delays spike between 7–9 PM in areas with narrow lanes!" Use charts, histograms, scatter plots.
Stage 5 — Modelling: Apply algorithms. Build a prediction model: "Given restaurant location, time, weather, predict delivery time." This is where machine learning often enters.
Stage 6 — Deployment: Put the model into production. Now Swiggy's app shows you accurate ETAs in real-time.
Stage 7 — Monitoring & Maintenance: Track model accuracy. Retrain when data patterns change (festivals, new roads, COVID lockdowns).
Now YOU try it → Think of a problem at your college (e.g., "Why is the mess food wasted on Fridays?"). Write down what data you'd collect, how you'd clean it, and what chart you'd make. That's Stage 1–4 already done!
2. The 3Vs of Big Data — Volume, Velocity, Variety
Not all data is "Big Data." Your college attendance register is just regular data. But when we talk about Aadhaar managing biometric records of 1.4 billion Indians, or UPI processing 10 billion transactions per month — that's Big Data. What makes data "big"?
| V | Meaning | Indian Example | Scale |
|---|---|---|---|
| Volume | Sheer amount of data | Aadhaar biometric database | ~30 petabytes (30 million GB) |
| Velocity | Speed of data generation | UPI transactions via PhonePe/GPay | ~3,800 transactions per second |
| Variety | Different formats of data | Social media (text + images + video + location) | Structured + unstructured + semi-structured |
Analogy: Think of Big Data like the Kumbh Mela. Volume = 200 million pilgrims. Velocity = thousands arriving every minute. Variety = they come by train, bus, foot, boat — carrying different languages, needs, and demographics. Managing this crowd requires "Big Crowd" techniques, just like managing Big Data requires specialised tools.
Now YOU try it → List 3 Indian organisations that deal with Big Data and identify which V is their biggest challenge.
3. Big Data Challenges
Having lots of data isn't automatically useful. It brings serious challenges:
| Challenge | What It Means | Indian Example |
|---|---|---|
| Storage | Where do you keep petabytes of data? | UIDAI needed specialised data centres across India for Aadhaar |
| Processing | Traditional computers can't handle it. Need distributed computing. | IRCTC processes 25 million ticket requests on Tatkal day — single servers crash |
| Security | Sensitive data must be protected from breaches | Aadhaar data leaks controversy; DPDP Act 2023 |
| Privacy | Ethical use of personal data | Should Paytm know your spending habits and sell this to advertisers? |
| Quality | Garbage in, garbage out | Indian census data has inconsistencies across states due to language differences |
| Talent Gap | India needs 1 million+ data professionals by 2026 (NASSCOM) | Only 11% of Indian engineering graduates are trained in data skills |
4. Tools Deep-Dive: Hadoop & Apache Spark
Hadoop — The Foundation of Big Data Processing
Plain English: Imagine you have a 1,000-page book and need to count how many times the word "India" appears. One person reading alone takes 10 hours. But if you tear the book into 100 chunks and give each chunk to a different person, they all count simultaneously and report back — done in 6 minutes. That's Hadoop.
Technical Definition: Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It has three core components:
| Component | What It Does | Analogy |
|---|---|---|
| HDFS (Hadoop Distributed File System) | Stores data across multiple machines with replication | Like storing copies of your notes in 3 different lockers — if one locker breaks, you don't lose anything |
| MapReduce | Processes data in parallel: Map (split & process) → Reduce (combine results) | Like a class test where each row of students grades one question, then the teacher combines all scores |
| YARN (Yet Another Resource Negotiator) | Manages cluster resources — decides which job gets how much CPU/RAM | Like a hostel warden allocating rooms — ensuring everyone gets fair space |
Apache Spark — The Faster Alternative
Hadoop's MapReduce writes intermediate results to disk (slow). Spark keeps them in memory (RAM) — making it 100× faster for many tasks.
| Feature | Hadoop MapReduce | Apache Spark |
|---|---|---|
| Speed | Slower (disk-based) | 100× faster (in-memory) |
| Ease of Use | Complex Java code | Simple Python/Scala APIs |
| Real-time Processing | ❌ Batch only | ✅ Batch + Streaming |
| Best For | Massive batch jobs (overnight processing) | Real-time analytics, ML pipelines |
| Used By (India) | TCS, Infosys legacy systems | Swiggy, Razorpay, Dream11 |
5. Visualisation & Analysis Tools: Tableau, R, and Excel
Tableau — See Your Data Come Alive
Plain English: Tableau is like Instagram filters for data. You drag-and-drop columns, and it instantly creates beautiful charts, maps, and dashboards. No coding required.
Technical: Tableau is a visual analytics platform that connects to databases (SQL, Excel, CSV, cloud sources) and lets you create interactive dashboards through a drag-and-drop interface.
Free version: Tableau Public — completely free, unlimited dashboards, hosted on Tableau's cloud. Perfect for students building portfolios.
R Programming — The Statistician's Weapon
R is a programming language built specifically for statistics and data analysis. Think of it as Excel on steroids — you can do everything Excel does, plus advanced statistical modelling, machine learning, and publication-quality visualisations.
R # Load Indian population data and create a visualization data <- read.csv("india_census_2011_2024.csv") library(ggplot2) ggplot(data, aes(x=Year, y=Population, fill=State)) + geom_bar(stat="identity") + labs(title="India Population Growth by State")
Excel — Don't Underestimate the Classic
Every data career starts with Excel. 90% of Indian businesses still use Excel for data analysis. Pivot tables, VLOOKUP, conditional formatting, and charts are non-negotiable skills. Even data scientists at Flipkart use Excel for quick ad-hoc analysis before firing up Python.
6. Big Data on Cloud
Analogy: Setting up your own Hadoop cluster is like buying a diesel generator for your home. It works, but it's expensive, noisy, and you maintain it yourself. Cloud-based Big Data is like using the electricity grid — you just plug in and pay for what you use.
| Cloud Service | Provider | What It Does | Free Tier? |
|---|---|---|---|
| AWS EMR | Amazon | Managed Hadoop/Spark clusters | Limited free tier |
| Google BigQuery | Serverless data warehouse — query petabytes with SQL | ✅ 1 TB/month free | |
| Azure HDInsight | Microsoft | Managed Hadoop, Spark, Kafka | Free credits for students |
| Databricks | Databricks | Unified analytics platform (Spark-based) | Community edition free |
7. Job Roles & Career Paths in Data Science
| Role | What They Do | Key Skills | Entry Salary (India) |
|---|---|---|---|
| Data Analyst | Clean data, create dashboards, generate reports | Excel, SQL, Tableau, basic Python | ₹4–6 LPA |
| Data Scientist | Build predictive models, run experiments | Python, R, ML, Statistics, SQL | ₹6–12 LPA |
| Data Engineer | Build data pipelines, manage infrastructure | SQL, Spark, Airflow, AWS/GCP, Python | ₹6–10 LPA |
| ML Engineer | Deploy ML models to production | Python, TensorFlow, Docker, APIs | ₹8–15 LPA |
| Business Analyst | Translate business needs to data requirements | Excel, SQL, PowerBI, Communication | ₹4–7 LPA |
| MIS Executive | Manage information systems, daily/weekly reports | Excel, VBA, SQL basics | ₹3–5 LPA |
Now YOU try it → Go to Naukri.com and search "Data Analyst fresher." Note the top 5 skills mentioned in job descriptions. You'll find Excel, SQL, and Tableau appear in 90% of them.
Learn by Doing — 3-Tier Lab Structure
🟢 Tier 1 — GUIDED TASK: Build an India Population Dashboard in Google Sheets
Step 1: Open Google Sheets
Go to sheets.google.com → Click "+ Blank spreadsheet"
Step 2: Enter Indian Census Data
Create the following table (Column A = State, Column B = Population 2011, Column C = Population 2024 Est.):
| State | Pop. 2011 (Cr) | Pop. 2024 (Cr) |
|---|---|---|
| Uttar Pradesh | 19.98 | 23.5 |
| Maharashtra | 11.24 | 12.8 |
| Bihar | 10.41 | 13.1 |
| West Bengal | 9.13 | 10.2 |
| Madhya Pradesh | 7.27 | 8.7 |
| Tamil Nadu | 7.21 | 7.8 |
| Rajasthan | 6.86 | 8.2 |
| Karnataka | 6.11 | 7.0 |
| Gujarat | 6.04 | 7.1 |
| Kerala | 3.34 | 3.5 |
Step 3: Add a Growth Rate Column
In Column D, type the header "Growth %". In cell D2, enter: =(C2-B2)/B2*100. Drag the formula down for all rows.
Step 4: Create a Bar Chart
- Select columns A, B, and C (all rows)
- Click Insert → Chart
- Choose "Grouped Bar Chart"
- Title it: "India Population Growth by State (2011 vs 2024)"
- Under "Customize" → change the 2011 colour to blue and 2024 to orange
Step 5: Create a Growth Rate Pie Chart
- Select columns A and D
- Insert → Chart → Pie Chart
- Title: "Population Growth Rate (%) — Top 10 Indian States"
Step 6: Add Filters & Conditional Formatting
- Select all data → Data → Create a filter
- Select Column D → Format → Conditional formatting → "Greater than 20" → Set to red background
Step 7: Name your Dashboard Tab
Right-click the sheet tab → Rename to "India Population Dashboard"
🎉 Congratulations! You've built your first data dashboard. Take a screenshot — this is your first portfolio piece.
🟡 Tier 2 — SEMI-GUIDED TASK: Tableau Public Dashboard with COVID-19 India Data
Your Mission:
Create an interactive COVID-19 India dashboard on Tableau Public showing state-wise case trends.
Hints:
- Data Source: Download India COVID-19 data from
api.covid19india.org(archived) or Kaggle (search "COVID-19 India dataset") - Tool: Download Tableau Public (free) from
public.tableau.com - Connect: Open Tableau Public → Connect → Text File (CSV)
- Dashboard Elements: You need:
- A line chart showing daily cases over time
- A map of India showing state-wise total cases (use the "filled map" chart type)
- A bar chart showing top 10 states by recovery rate
- A filter for date range
- Publish: Save to Tableau Public → Get a shareable URL for your portfolio
🔴 Tier 3 — OPEN CHALLENGE: Data Pipeline Proposal for a Local Indian Business
The Brief:
Choose a real local business (your college canteen, a neighbourhood kirana store, a local coaching centre, or a small clinic). Design a complete data pipeline proposal covering:
- Problem Statement: What business question will you answer?
- Data Collection Plan: What data do you need? How will you collect it? (Google Forms, manual entry, POS system)
- Data Cleaning Strategy: What will be messy? How will you fix it?
- Analysis Plan: What charts/metrics will you create?
- Dashboard Mockup: Sketch (hand-drawn is fine) showing layout
- Tools: Google Sheets/Excel for analysis, Tableau Public for dashboard
- Business Impact: How will this save money or increase revenue?
- Budget: Total cost (hint: it should be ₹0 if using free tools)
Deliverable: A 3–5 page Google Doc proposal. This becomes a real portfolio piece and can be your first freelance pitch.
Industry Spotlight — A Day in the Life
👩💻 Priya Sharma, 26 — Data Analyst at Flipkart, Bangalore
Background: BCA from Chandigarh University. No coding experience before college. Self-taught Excel and SQL in 2nd year. Built 5 Tableau dashboards during internship at a local CA firm. Got placed at Flipkart through campus recruitment.
A Typical Day:
9:00 AM — Morning standup with the marketplace analytics team. Review yesterday's metrics: GMV (Gross Merchandise Value), seller performance, return rates.
10:00 AM — Pull data from Flipkart's data warehouse using SQL queries. "Show me top 20 sellers in electronics category with return rate > 15% in last 30 days."
11:30 AM — Clean the data in Python (pandas). Remove duplicates, handle null values, standardise seller names.
1:00 PM — Lunch at Flipkart's cafeteria. Discuss a new A/B test design with the product team.
2:00 PM — Build a Tableau dashboard showing seller quality scores. Present to the category manager: "These 5 sellers need quality warnings."
4:00 PM — Write a data quality report in Google Sheets. Add conditional formatting for KPIs.
5:30 PM — Learn PySpark on Databricks (company-sponsored learning hour). Working towards Data Scientist role.
| Detail | Info |
|---|---|
| Tools Used Daily | SQL (BigQuery), Python (pandas, matplotlib), Tableau, Excel, Jira |
| Entry Salary (2024) | ₹4–6 LPA + benefits |
| Mid-Level (3–5 yrs) | ₹8–15 LPA |
| Senior (7+ yrs) | ₹18–35 LPA |
| Companies Hiring | Flipkart, Swiggy, Zomato, Paytm, Meesho, TCS, Infosys, Mu Sigma, Fractal Analytics, Tiger Analytics, Latent View |
Earn With It — Freelance & Income Roadmap
💰 Your Earning Path After This Chapter
Portfolio Piece: "India Population Trend Dashboard 2011–2024" — a polished Google Sheets/Tableau dashboard with charts, filters, and growth analysis.
Beginner Gig Ideas:
• Excel/Google Sheets dashboard for local coaching centres (attendance + fee tracking) — ₹2,000–₹5,000
• Sales data dashboard for small retailers — ₹3,000–₹8,000
• Survey data analysis for NGOs/student projects — ₹1,500–₹4,000
• Monthly expense tracker dashboard for small businesses — ₹2,000–₹6,000
| Platform | Best For | Typical Rate |
|---|---|---|
| Internshala | Indian student internships & freelance projects | ₹2,000–₹8,000/project |
| Fiverr | Global clients, quick dashboard gigs | $10–$50/gig (₹800–₹4,000) |
| Upwork | Longer projects, higher rates | $15–$40/hour |
| Direct outreach to Indian businesses | ₹3,000–₹10,000/project | |
| WhatsApp/Local | Nearby shops, schools, clinics | ₹2,000–₹8,000/project |
⏱️ Time to First Earning: 2–3 weeks (if you complete Tier 1 lab and reach out to 10 local businesses)
MCQ Assessment Bank — 30 Questions (Bloom's Mapped)
Remember / Identify (Q1–Q5)
Which of the following is NOT one of the 3Vs of Big Data?
- Volume
- Velocity
- Validity
- Variety
HDFS stands for:
- Hadoop Data File Storage
- Hadoop Distributed File System
- High Data Flow System
- Hadoop Dynamic File Structure
Which stage of the data science lifecycle involves removing duplicates and handling missing values?
- Data Collection
- Modelling
- Data Cleaning
- Deployment
Tableau is primarily used for:
- Database management
- Data visualisation and dashboards
- Machine learning model training
- Web development
YARN in Hadoop stands for:
- Yet Another Resource Negotiator
- Yearly Analysis Resource Node
- YAML-based Application Resource Network
- Yield Adjusted Resource Navigator
Understand / Explain (Q6–Q10)
Why is the data cleaning stage considered the most time-consuming in the data science lifecycle?
- It requires expensive software
- Real-world data has inconsistencies, missing values, and duplicates that must be fixed before analysis
- It needs approval from management
- It involves writing machine learning algorithms
How does HDFS ensure data safety when a node in the cluster fails?
- It compresses all data before storage
- It stores encrypted backups on the cloud
- It replicates each data block across multiple nodes (default: 3 copies)
- It uses RAID arrays on each machine
Which V of Big Data does Aadhaar's 1.4 billion biometric records best represent?
- Velocity
- Variety
- Volume
- Veracity
What is the key advantage of Apache Spark over Hadoop MapReduce?
- Spark is free while Hadoop is paid
- Spark processes data in-memory, making it up to 100× faster
- Spark can only process structured data
- Spark doesn't need a cluster
Explain why Google BigQuery is described as "serverless":
- It runs without electricity
- Users don't manage any server infrastructure — Google handles everything
- It can only be used offline
- It doesn't store any data
Apply / Demonstrate (Q11–Q15)
A Pune coaching centre wants to track student attendance, fee payments, and test scores. Which tool would you recommend for a non-technical owner?
- Apache Hadoop
- Google Sheets with Pivot Tables
- TensorFlow
- MongoDB
You're building a sales dashboard and the "City" column has entries like "Delhi", "delhi", "New Delhi", "DELHI". Which data science lifecycle stage addresses this?
- Deployment
- Modelling
- Data Cleaning
- Problem Definition
To calculate growth rate percentage in Google Sheets where B2=2011 population and C2=2024 population, the correct formula is:
- =(B2-C2)/C2*100
- =(C2-B2)/B2*100
- =C2/B2
- =(C2+B2)/2*100
Swiggy wants to identify which restaurants have the highest cancellation rates in Mumbai. Which visualisation type is most appropriate?
- Pie chart of all restaurants
- Horizontal bar chart — top 20 restaurants sorted by cancellation rate
- Scatter plot of revenue vs profit
- Line chart of daily temperatures
A freelance client asks you to create a dashboard showing monthly expenses across 5 categories. You have data in CSV format. Describe the correct workflow:
- Build an ML model → Deploy on AWS → Send link
- Import CSV into Google Sheets → Clean data → Create pivot table → Build charts → Share link
- Print the CSV and highlight numbers manually
- Upload to GitHub and write Python code
Analyze / Compare (Q16–Q20)
Compare Hadoop MapReduce and Apache Spark. In which scenario would Hadoop MapReduce still be preferred over Spark?
- Real-time fraud detection at Razorpay
- Processing overnight batch jobs on a low-RAM cluster with petabytes of log data
- Building a recommendation engine for Swiggy
- Interactive data exploration in Jupyter notebooks
Differentiate between structured, semi-structured, and unstructured data with Indian examples:
- All data is structured if stored in a computer
- Structured = Excel/SQL tables (Aadhaar records); Semi-structured = JSON/XML (Swiggy API responses); Unstructured = images, videos, tweets (Instagram posts)
- Unstructured data cannot be analysed
- Semi-structured data is always stored in Hadoop
Examine why UPI transaction data represents the "Velocity" V of Big Data rather than "Volume":
- UPI doesn't generate much data
- UPI data is generated at extreme speed (3,800+ transactions per second), requiring real-time processing rather than just large storage
- UPI only processes text data
- UPI uses Hadoop for storage
Compare Excel and Tableau for a freelance data dashboard project. When would you choose Excel over Tableau?
- When the client needs interactive web-based dashboards
- When the client already uses Excel daily and needs a simple solution they can maintain themselves
- When the dataset exceeds 50 million rows
- When geographic map visualisations are required
Analyse the difference between a Data Analyst and a Data Scientist role at an Indian IT company:
- They are the same role with different titles
- Data Analysts focus on descriptive analytics (what happened), while Data Scientists build predictive models (what will happen)
- Data Scientists only use Excel
- Data Analysts earn more than Data Scientists
Evaluate / Judge (Q21–Q25)
UIDAI claims Aadhaar data is completely secure. A journalist discovers that Aadhaar numbers of 100 million Indians were leaked through an unsecured government API. Evaluate the failure:
- The data was too big to secure
- Hadoop was the wrong technology
- The failure was in API security and access control, not in the database technology itself
- Aadhaar should not collect data at all
A startup CEO says: "We don't need data cleaning. Our ML model will learn from noisy data." Judge this statement:
- Correct — modern ML handles everything automatically
- Incorrect — "Garbage in, garbage out" applies. Noisy data leads to inaccurate models with poor predictions
- Partially correct — only deep learning needs clean data
- Correct if they use Hadoop
Paytm uses customer transaction data to show personalised loan offers. Evaluate the ethical implications:
- No ethical issues — it's just marketing
- Potential concerns: user consent, data privacy under DPDP Act, targeting vulnerable users with high-interest loans, algorithmic bias
- Ethical only if Paytm earns profit
- Unethical because Paytm is a private company
An Indian company has 500 GB of sales data and is considering building an on-premise Hadoop cluster vs using Google BigQuery. Justify the better choice:
- Hadoop — because it's open-source and free
- BigQuery — because 500 GB is small, BigQuery's free tier covers it, no infrastructure management needed, and results come in seconds
- Neither — Excel can handle 500 GB
- Hadoop — because it's faster than cloud
Critique the claim: "India doesn't need data scientists because AI will automate all analysis."
- Valid — AI replaces all human analysis
- Invalid — AI automates routine tasks but humans are needed for problem framing, ethical judgement, domain expertise, and communicating insights to business stakeholders
- Valid for small companies only
- Invalid because AI doesn't work in India
Create / Design (Q26–Q30)
Design a data collection strategy for a college canteen that wants to reduce food waste. Which approach is most comprehensive?
- Ask students to fill a Google Form once a year
- Track daily menu items, quantity prepared, quantity consumed, leftover weight, day of week, weather, and special events — using a simple Google Sheet updated by canteen staff
- Install an AI camera system costing ₹10 lakh
- Count the number of students entering the canteen
Propose a dashboard design for IRCTC to display real-time train booking statistics. What elements should it include?
- Just a table with numbers
- Live counter of bookings/minute, heat map of top routes, bar chart of class-wise bookings (1AC/2AC/SL), alerts for high-demand trains, and a trend line of daily bookings
- A pie chart of all 13,000 trains
- A single number showing total bookings today
You're creating a Fiverr gig for "Excel Dashboard Creation for Indian Small Businesses." Which gig description is most likely to attract clients?
- "I know Excel. Contact me."
- "I will create a professional, interactive Excel dashboard with charts, pivot tables, and conditional formatting for your sales, inventory, or financial data. Delivery in 48 hours. Includes 1 revision. Starting at ₹2,000."
- "Expert in Hadoop and Spark. Can build anything."
- "Free dashboards for everyone!"
Build a mini data pipeline for a neighbourhood kirana store. Which sequence is correct?
- Deploy ML model → Collect data → Build dashboard
- Define problem (which products expire unsold?) → Collect daily sales data via Google Sheets → Clean & organise → Analyse with pivot tables → Create dashboard showing slow-moving products → Present to owner with recommendations
- Buy Hadoop servers → Hire 5 data scientists → Start analysis
- Build a mobile app → Collect biometric data → Use AI
Design a data-driven solution for Zomato to identify restaurants that are likely to shut down within 6 months. What data points would you propose collecting?
- Only the restaurant name
- Monthly order volume trend, average rating trend, customer complaint frequency, delivery time reliability, menu price changes, competitor density in the area, owner response rate to reviews
- Just the current star rating
- GPS coordinates only
Short Answer Questions (2–3 marks each)
Question 1 (2 marks)
Define the 3Vs of Big Data and give one Indian example for each.
Question 2 (3 marks)
Explain the difference between HDFS and MapReduce in Hadoop. How do they work together to process large datasets?
Question 3 (2 marks)
Why is data cleaning considered the most time-consuming step in the data science lifecycle? Give two real-world examples of dirty data.
Question 4 (3 marks)
Compare Apache Spark and Hadoop MapReduce on three parameters: speed, ease of use, and real-time processing capability.
Question 5 (2 marks)
List any four job roles in Data Science and mention one key skill required for each role.
Long Answer & Case Studies (10 marks each)
📋 Case Study 1: Aadhaar — The World's Largest Biometric Database (10 marks)
India's Aadhaar system, managed by UIDAI, stores biometric data (fingerprints, iris scans) and demographic information for over 1.4 billion residents. It processes over 100 million authentication requests daily for services like bank account verification, SIM card activation, and government subsidy disbursement (DBT).
Answer the following:
- (2 marks) Identify which V of Big Data is the most significant challenge for Aadhaar. Justify your answer with data.
- (3 marks) Describe the data pipeline that runs when a citizen uses Aadhaar for eKYC at a bank. Cover data flow from the fingerprint scanner to the authentication response.
- (2 marks) Discuss two data privacy concerns with Aadhaar and how the DPDP Act 2023 addresses them.
- (3 marks) Propose a data analytics dashboard for UIDAI that shows: authentication success rates by state, peak usage hours, and failure categories. Sketch the dashboard layout and specify chart types.
📋 Case Study 2: UPI/NPCI — Real-Time Transaction Analytics at 10 Billion Scale (10 marks)
India's Unified Payments Interface (UPI), managed by NPCI, processed 13.89 billion transactions worth ₹20.64 lakh crore in a single month (March 2024). Apps like PhonePe, Google Pay, and Paytm together handle ~3,800 transactions per second during peak hours.
Answer the following:
- (2 marks) Explain why UPI's primary Big Data challenge is Velocity rather than Volume. How does this affect technology choices?
- (3 marks) Design a real-time fraud detection data pipeline for UPI. Specify: what data points you'd collect per transaction, what analytics you'd run, and what action the system should take when fraud is detected.
- (2 marks) Compare how NPCI could use Hadoop vs Spark for transaction analytics. Which is more appropriate and why?
- (3 marks) Create a data-driven strategy for NPCI to predict UPI downtime before it happens. What historical data would you analyse? What patterns would indicate impending failure?
Chapter Summary — Tweet-Sized Bullet Points
📝 Key Takeaways
- 📊 Data Science lifecycle: Problem → Collect → Clean → Analyse → Model → Deploy → Monitor. Cleaning takes 60-80% of time.
- 📦 Big Data = 3Vs: Volume (Aadhaar = 30PB), Velocity (UPI = 3,800 txn/sec), Variety (text + images + video).
- 🐘 Hadoop = distributed storage (HDFS) + parallel processing (MapReduce) + resource management (YARN).
- ⚡ Spark is 100× faster than MapReduce because it processes data in-memory, not on disk.
- 📈 Tableau = drag-and-drop visual analytics. Free version (Tableau Public) is perfect for portfolios.
- ☁️ Cloud Big Data (BigQuery, AWS EMR) eliminates infrastructure hassle. BigQuery gives 1 TB/month free.
- 💼 Fastest career path: Excel + SQL + Tableau → Data Analyst at ₹4-6 LPA. No coding required initially.
- 💰 Start earning NOW: Build Excel dashboards for local businesses. First gig: ₹2,000-₹8,000 on Internshala.
- 🇮🇳 India needs 1 million+ data professionals by 2026 — massive opportunity for BCA/B.Tech students.
- 🏗️ Portfolio piece from this chapter: "India Population Trend Dashboard 2011-2024" on Google Sheets.
My Earning Checkpoint — Self-Assessment
| Skill Learned | Tool Practised | Portfolio Item Added | Gig Ready? |
|---|---|---|---|
| Data Science Lifecycle | Google Sheets | India Population Dashboard | ✅ Yes — can explain lifecycle to clients |
| 3Vs of Big Data | Conceptual | — | ✅ Yes — can discuss in interviews |
| Data Cleaning | Google Sheets / Excel | Cleaned Census Dataset | ✅ Yes — cleaning is a billable skill |
| Dashboard Creation | Google Sheets, Tableau Public | Population Dashboard + COVID Dashboard | ✅ Yes — ₹2,000–₹8,000/project |
| Data Visualisation | Charts, Pivot Tables | Bar charts, Pie charts, Filters | ✅ Yes — ready for Internshala gigs |
| Hadoop/Spark Concepts | Conceptual (no hands-on yet) | — | ⬜ Not yet — need PySpark practice |
| Data Pipeline Design | Google Docs (proposal) | Business Data Pipeline Proposal | ✅ Yes — can pitch to local businesses |
✅ Unit 1 complete. Ready for Unit 2: AI & Machine Learning!
[QR: Link to EduArtha video tutorial — Data Science & Big Data]