Orientation to Computing — II

Unit 1: Data Science & Big Data

From raw data to actionable dashboards — master the data science lifecycle, Big Data tools, and start earning by building real dashboards for Indian businesses.

⏱️ Time to Complete: 8–10 hours  |  💰 Earning Potential: ₹5,000–₹15,000/month  |  📝 30 MCQs (Bloom's Mapped)

💼 Jobs this unlocks: Data Analyst (₹4–6 LPA)  |  Junior Data Scientist (₹6–10 LPA)  |  MIS Executive (₹3–5 LPA)

Section A

Opening Hook — The Data Behind India's Digital Revolution

🏢 How Zomato Knows What You'll Order Before You Do

Every time you open Zomato, a complex data engine fires up. It analyses your past orders, time of day, weather in your city, trending restaurants nearby, and even how long you browse before ordering. This isn't magic — it's data science. Zomato processes over 2 million orders per day across 500+ Indian cities. Their recommendation engine — powered by Big Data analytics — increases order conversion by 35%.

Behind the scenes, teams of data analysts at Zomato's Gurugram HQ use tools like Hadoop, Spark, Tableau, and Python to process terabytes of data daily. They predict delivery times within 2-minute accuracy, optimise delivery partner routes, and decide which restaurant banner you see first.

What if YOU had built this? What if you could take raw data — messy, unstructured, massive — and turn it into insights that drive a ₹10,000 crore business? That's exactly what this chapter teaches you.

🇮🇳 Zomato🇮🇳 Flipkart🇮🇳 Reliance Jio🇮🇳 Paytm🇮🇳 Swiggy🇮🇳 UIDAI (Aadhaar)
India generates 20% of the world's data but has only 5% of the world's data scientists. This means massive demand and fewer competitors. A data-literate student in India has an extraordinary career advantage right now. The Indian data analytics market is expected to reach $118 billion by 2030 (NASSCOM, 2024).
Section B

Learning Outcomes — Bloom's Taxonomy Mapped

Bloom's LevelLearning Outcome
🔵 RememberList the 7 stages of the data science lifecycle and define Volume, Velocity, and Variety
🔵 UnderstandExplain how Hadoop's HDFS and MapReduce work together to process Big Data, using Indian examples
🟢 ApplyBuild a data dashboard in Google Sheets using real Indian census data with charts and filters
🟢 AnalyzeCompare Hadoop vs Spark vs cloud-based analytics and determine which suits different Indian business scenarios
🟠 EvaluateAssess the data privacy and ethical challenges in India's Aadhaar system and propose safeguards
🟠 CreateDesign a complete data pipeline proposal for a local Indian business, from data collection to dashboard delivery
Section C

Concept Explanation — Data Science & Big Data from Scratch

1. The Data Science Lifecycle

Think of data science like cooking a meal. You don't just throw ingredients into a pot randomly. You first decide what to cook (problem definition), go to the market (data collection), wash and chop vegetables (data cleaning), taste and adjust (analysis), cook the dish (modelling), serve it (deployment), and get feedback (monitoring). Data science follows the exact same logical flow.

📊 The 7 Stages of the Data Science Lifecycle

Stage 1 — Problem Definition: What business question are we answering? Example: "Why are Swiggy deliveries taking 45+ minutes in Pune on weekends?"

Stage 2 — Data Collection: Gathering raw data from databases, APIs, web scraping, sensors, surveys. Swiggy collects GPS data, order timestamps, restaurant prep times, traffic data.

Stage 3 — Data Cleaning (Wrangling): Real data is messy. Missing values, duplicates, wrong formats. 60–80% of a data scientist's time goes here. Think of it as removing stones from rice before cooking.

Stage 4 — Exploratory Data Analysis (EDA): Visualise the data. Find patterns. "Aha, weekend delays spike between 7–9 PM in areas with narrow lanes!" Use charts, histograms, scatter plots.

Stage 5 — Modelling: Apply algorithms. Build a prediction model: "Given restaurant location, time, weather, predict delivery time." This is where machine learning often enters.

Stage 6 — Deployment: Put the model into production. Now Swiggy's app shows you accurate ETAs in real-time.

Stage 7 — Monitoring & Maintenance: Track model accuracy. Retrain when data patterns change (festivals, new roads, COVID lockdowns).

Flipkart's data science team follows this exact lifecycle. During Big Billion Days, they predict which products will sell most in which pincode, pre-position inventory in nearby warehouses (demand forecasting), and dynamically price items — all using data science pipelines running on Apache Spark clusters.

Now YOU try it → Think of a problem at your college (e.g., "Why is the mess food wasted on Fridays?"). Write down what data you'd collect, how you'd clean it, and what chart you'd make. That's Stage 1–4 already done!

2. The 3Vs of Big Data — Volume, Velocity, Variety

Not all data is "Big Data." Your college attendance register is just regular data. But when we talk about Aadhaar managing biometric records of 1.4 billion Indians, or UPI processing 10 billion transactions per month — that's Big Data. What makes data "big"?

VMeaningIndian ExampleScale
VolumeSheer amount of dataAadhaar biometric database~30 petabytes (30 million GB)
VelocitySpeed of data generationUPI transactions via PhonePe/GPay~3,800 transactions per second
VarietyDifferent formats of dataSocial media (text + images + video + location)Structured + unstructured + semi-structured
The 3Vs are now 5Vs in industry. Modern frameworks add Veracity (accuracy — is the data trustworthy?) and Value (does the data actually help make decisions?). In interviews, mentioning 5Vs shows you're up-to-date.
Reliance Jio generates 25+ exabytes of data annually from its 450 million subscribers. That's 25 billion GB — equivalent to streaming Netflix continuously for 500 million years. They use this data for network optimisation, personalised content recommendations on JioTV, and targeted advertising on JioAds.

Analogy: Think of Big Data like the Kumbh Mela. Volume = 200 million pilgrims. Velocity = thousands arriving every minute. Variety = they come by train, bus, foot, boat — carrying different languages, needs, and demographics. Managing this crowd requires "Big Crowd" techniques, just like managing Big Data requires specialised tools.

Now YOU try it → List 3 Indian organisations that deal with Big Data and identify which V is their biggest challenge.

3. Big Data Challenges

Having lots of data isn't automatically useful. It brings serious challenges:

ChallengeWhat It MeansIndian Example
StorageWhere do you keep petabytes of data?UIDAI needed specialised data centres across India for Aadhaar
ProcessingTraditional computers can't handle it. Need distributed computing.IRCTC processes 25 million ticket requests on Tatkal day — single servers crash
SecuritySensitive data must be protected from breachesAadhaar data leaks controversy; DPDP Act 2023
PrivacyEthical use of personal dataShould Paytm know your spending habits and sell this to advertisers?
QualityGarbage in, garbage outIndian census data has inconsistencies across states due to language differences
Talent GapIndia needs 1 million+ data professionals by 2026 (NASSCOM)Only 11% of Indian engineering graduates are trained in data skills
Students confuse "more data" with "better insights." A messy dataset of 10 million rows can give worse results than a clean dataset of 10,000 rows. Quality always trumps quantity. Always clean before you analyse.

4. Tools Deep-Dive: Hadoop & Apache Spark

Hadoop — The Foundation of Big Data Processing

Plain English: Imagine you have a 1,000-page book and need to count how many times the word "India" appears. One person reading alone takes 10 hours. But if you tear the book into 100 chunks and give each chunk to a different person, they all count simultaneously and report back — done in 6 minutes. That's Hadoop.

Technical Definition: Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It has three core components:

ComponentWhat It DoesAnalogy
HDFS (Hadoop Distributed File System)Stores data across multiple machines with replicationLike storing copies of your notes in 3 different lockers — if one locker breaks, you don't lose anything
MapReduceProcesses data in parallel: Map (split & process) → Reduce (combine results)Like a class test where each row of students grades one question, then the teacher combines all scores
YARN (Yet Another Resource Negotiator)Manages cluster resources — decides which job gets how much CPU/RAMLike a hostel warden allocating rooms — ensuring everyone gets fair space
Flipkart runs one of India's largest Hadoop clusters — over 5,000 nodes processing 50+ petabytes of data. They use it for product recommendations, search ranking, fraud detection, and supply chain optimisation. When you search "blue shirt" on Flipkart, Hadoop clusters process your query against millions of products in milliseconds.

Apache Spark — The Faster Alternative

Hadoop's MapReduce writes intermediate results to disk (slow). Spark keeps them in memory (RAM) — making it 100× faster for many tasks.

FeatureHadoop MapReduceApache Spark
SpeedSlower (disk-based)100× faster (in-memory)
Ease of UseComplex Java codeSimple Python/Scala APIs
Real-time Processing❌ Batch only✅ Batch + Streaming
Best ForMassive batch jobs (overnight processing)Real-time analytics, ML pipelines
Used By (India)TCS, Infosys legacy systemsSwiggy, Razorpay, Dream11
In 2024 job interviews, Spark skills are more valued than Hadoop MapReduce. Most Indian companies have migrated to Spark or cloud-native solutions. Learn PySpark (Spark with Python) — it's the most in-demand Big Data skill on Naukri.com and LinkedIn India.

5. Visualisation & Analysis Tools: Tableau, R, and Excel

Tableau — See Your Data Come Alive

Plain English: Tableau is like Instagram filters for data. You drag-and-drop columns, and it instantly creates beautiful charts, maps, and dashboards. No coding required.

Technical: Tableau is a visual analytics platform that connects to databases (SQL, Excel, CSV, cloud sources) and lets you create interactive dashboards through a drag-and-drop interface.

Free version: Tableau Public — completely free, unlimited dashboards, hosted on Tableau's cloud. Perfect for students building portfolios.

R Programming — The Statistician's Weapon

R is a programming language built specifically for statistics and data analysis. Think of it as Excel on steroids — you can do everything Excel does, plus advanced statistical modelling, machine learning, and publication-quality visualisations.

R
# Load Indian population data and create a visualization
data <- read.csv("india_census_2011_2024.csv")
library(ggplot2)
ggplot(data, aes(x=Year, y=Population, fill=State)) +
  geom_bar(stat="identity") +
  labs(title="India Population Growth by State")

Excel — Don't Underestimate the Classic

Every data career starts with Excel. 90% of Indian businesses still use Excel for data analysis. Pivot tables, VLOOKUP, conditional formatting, and charts are non-negotiable skills. Even data scientists at Flipkart use Excel for quick ad-hoc analysis before firing up Python.

Excel dashboard creation is the easiest entry point to earning. Local shops, schools, clinics, and coaching centres in India need dashboards for attendance, sales, and inventory. You can build these in 2–4 hours and charge ₹2,000–₹8,000 per project on Internshala. No advanced coding needed — just Excel + Google Sheets.

6. Big Data on Cloud

Analogy: Setting up your own Hadoop cluster is like buying a diesel generator for your home. It works, but it's expensive, noisy, and you maintain it yourself. Cloud-based Big Data is like using the electricity grid — you just plug in and pay for what you use.

Cloud ServiceProviderWhat It DoesFree Tier?
AWS EMRAmazonManaged Hadoop/Spark clustersLimited free tier
Google BigQueryGoogleServerless data warehouse — query petabytes with SQL✅ 1 TB/month free
Azure HDInsightMicrosoftManaged Hadoop, Spark, KafkaFree credits for students
DatabricksDatabricksUnified analytics platform (Spark-based)Community edition free
Paytm processes 1.5 billion+ transactions monthly using Google Cloud's BigQuery for real-time fraud detection. When you pay ₹50 for chai via Paytm, BigQuery analyses the transaction pattern against millions of historical transactions in under 200 milliseconds to flag potential fraud.

7. Job Roles & Career Paths in Data Science

RoleWhat They DoKey SkillsEntry Salary (India)
Data AnalystClean data, create dashboards, generate reportsExcel, SQL, Tableau, basic Python₹4–6 LPA
Data ScientistBuild predictive models, run experimentsPython, R, ML, Statistics, SQL₹6–12 LPA
Data EngineerBuild data pipelines, manage infrastructureSQL, Spark, Airflow, AWS/GCP, Python₹6–10 LPA
ML EngineerDeploy ML models to productionPython, TensorFlow, Docker, APIs₹8–15 LPA
Business AnalystTranslate business needs to data requirementsExcel, SQL, PowerBI, Communication₹4–7 LPA
MIS ExecutiveManage information systems, daily/weekly reportsExcel, VBA, SQL basics₹3–5 LPA
The fastest path for a BCA/B.Tech student: Start as a Data Analyst (Excel + SQL + Tableau), build a portfolio of 3–5 dashboards, then learn Python for Data Science. This path can land you a ₹4–6 LPA job within 6 months of graduation. Companies hiring: TCS, Infosys, Wipro, Mu Sigma, Fractal Analytics, Tiger Analytics.

Now YOU try it → Go to Naukri.com and search "Data Analyst fresher." Note the top 5 skills mentioned in job descriptions. You'll find Excel, SQL, and Tableau appear in 90% of them.

Section D

Learn by Doing — 3-Tier Lab Structure

🟢 Tier 1 — GUIDED TASK: Build an India Population Dashboard in Google Sheets

⏱️ 60–90 minutesBeginnerZero prior knowledge assumed

Step 1: Open Google Sheets

Go to sheets.google.com → Click "+ Blank spreadsheet"

Step 2: Enter Indian Census Data

Create the following table (Column A = State, Column B = Population 2011, Column C = Population 2024 Est.):

StatePop. 2011 (Cr)Pop. 2024 (Cr)
Uttar Pradesh19.9823.5
Maharashtra11.2412.8
Bihar10.4113.1
West Bengal9.1310.2
Madhya Pradesh7.278.7
Tamil Nadu7.217.8
Rajasthan6.868.2
Karnataka6.117.0
Gujarat6.047.1
Kerala3.343.5

Step 3: Add a Growth Rate Column

In Column D, type the header "Growth %". In cell D2, enter: =(C2-B2)/B2*100. Drag the formula down for all rows.

Step 4: Create a Bar Chart

  1. Select columns A, B, and C (all rows)
  2. Click Insert → Chart
  3. Choose "Grouped Bar Chart"
  4. Title it: "India Population Growth by State (2011 vs 2024)"
  5. Under "Customize" → change the 2011 colour to blue and 2024 to orange

Step 5: Create a Growth Rate Pie Chart

  1. Select columns A and D
  2. Insert → Chart → Pie Chart
  3. Title: "Population Growth Rate (%) — Top 10 Indian States"

Step 6: Add Filters & Conditional Formatting

  1. Select all data → Data → Create a filter
  2. Select Column D → Format → Conditional formatting → "Greater than 20" → Set to red background

Step 7: Name your Dashboard Tab

Right-click the sheet tab → Rename to "India Population Dashboard"

🎉 Congratulations! You've built your first data dashboard. Take a screenshot — this is your first portfolio piece.

🟡 Tier 2 — SEMI-GUIDED TASK: Tableau Public Dashboard with COVID-19 India Data

⏱️ 90–120 minutesIntermediateHints provided, you fill the gaps

Your Mission:

Create an interactive COVID-19 India dashboard on Tableau Public showing state-wise case trends.

Hints:

  1. Data Source: Download India COVID-19 data from api.covid19india.org (archived) or Kaggle (search "COVID-19 India dataset")
  2. Tool: Download Tableau Public (free) from public.tableau.com
  3. Connect: Open Tableau Public → Connect → Text File (CSV)
  4. Dashboard Elements: You need:
    • A line chart showing daily cases over time
    • A map of India showing state-wise total cases (use the "filled map" chart type)
    • A bar chart showing top 10 states by recovery rate
    • A filter for date range
  5. Publish: Save to Tableau Public → Get a shareable URL for your portfolio
Stretch Goal: Add a calculated field for "Recovery Rate %" = (Recovered / Confirmed) × 100. Which state has the best recovery rate?

🔴 Tier 3 — OPEN CHALLENGE: Data Pipeline Proposal for a Local Indian Business

⏱️ 2–3 hoursAdvancedNo instructions — real-world mini-project

The Brief:

Choose a real local business (your college canteen, a neighbourhood kirana store, a local coaching centre, or a small clinic). Design a complete data pipeline proposal covering:

  1. Problem Statement: What business question will you answer?
  2. Data Collection Plan: What data do you need? How will you collect it? (Google Forms, manual entry, POS system)
  3. Data Cleaning Strategy: What will be messy? How will you fix it?
  4. Analysis Plan: What charts/metrics will you create?
  5. Dashboard Mockup: Sketch (hand-drawn is fine) showing layout
  6. Tools: Google Sheets/Excel for analysis, Tableau Public for dashboard
  7. Business Impact: How will this save money or increase revenue?
  8. Budget: Total cost (hint: it should be ₹0 if using free tools)

Deliverable: A 3–5 page Google Doc proposal. This becomes a real portfolio piece and can be your first freelance pitch.

This exact proposal format is what freelancers send to clients. Polish it well, and you can send it to local businesses on WhatsApp/email offering ₹3,000–₹8,000 dashboard services. Many students have landed their first paying client from this exercise.
Section E

Industry Spotlight — A Day in the Life

👩‍💻 Priya Sharma, 26 — Data Analyst at Flipkart, Bangalore

Background: BCA from Chandigarh University. No coding experience before college. Self-taught Excel and SQL in 2nd year. Built 5 Tableau dashboards during internship at a local CA firm. Got placed at Flipkart through campus recruitment.

A Typical Day:

9:00 AM — Morning standup with the marketplace analytics team. Review yesterday's metrics: GMV (Gross Merchandise Value), seller performance, return rates.

10:00 AM — Pull data from Flipkart's data warehouse using SQL queries. "Show me top 20 sellers in electronics category with return rate > 15% in last 30 days."

11:30 AM — Clean the data in Python (pandas). Remove duplicates, handle null values, standardise seller names.

1:00 PM — Lunch at Flipkart's cafeteria. Discuss a new A/B test design with the product team.

2:00 PM — Build a Tableau dashboard showing seller quality scores. Present to the category manager: "These 5 sellers need quality warnings."

4:00 PM — Write a data quality report in Google Sheets. Add conditional formatting for KPIs.

5:30 PM — Learn PySpark on Databricks (company-sponsored learning hour). Working towards Data Scientist role.

DetailInfo
Tools Used DailySQL (BigQuery), Python (pandas, matplotlib), Tableau, Excel, Jira
Entry Salary (2024)₹4–6 LPA + benefits
Mid-Level (3–5 yrs)₹8–15 LPA
Senior (7+ yrs)₹18–35 LPA
Companies HiringFlipkart, Swiggy, Zomato, Paytm, Meesho, TCS, Infosys, Mu Sigma, Fractal Analytics, Tiger Analytics, Latent View
Section F

Earn With It — Freelance & Income Roadmap

💰 Your Earning Path After This Chapter

Portfolio Piece: "India Population Trend Dashboard 2011–2024" — a polished Google Sheets/Tableau dashboard with charts, filters, and growth analysis.

Beginner Gig Ideas:

• Excel/Google Sheets dashboard for local coaching centres (attendance + fee tracking) — ₹2,000–₹5,000

• Sales data dashboard for small retailers — ₹3,000–₹8,000

• Survey data analysis for NGOs/student projects — ₹1,500–₹4,000

• Monthly expense tracker dashboard for small businesses — ₹2,000–₹6,000

PlatformBest ForTypical Rate
InternshalaIndian student internships & freelance projects₹2,000–₹8,000/project
FiverrGlobal clients, quick dashboard gigs$10–$50/gig (₹800–₹4,000)
UpworkLonger projects, higher rates$15–$40/hour
LinkedInDirect outreach to Indian businesses₹3,000–₹10,000/project
WhatsApp/LocalNearby shops, schools, clinics₹2,000–₹8,000/project

⏱️ Time to First Earning: 2–3 weeks (if you complete Tier 1 lab and reach out to 10 local businesses)

Start local, go global. Your first client won't come from Fiverr — it'll be a coaching centre owner you know, or your parent's friend who runs a shop. Build 2–3 free dashboards for people you know. Take screenshots. Then create your Fiverr/Internshala gig with real portfolio samples.
Section G

MCQ Assessment Bank — 30 Questions (Bloom's Mapped)

Remember / Identify (Q1–Q5)

Q1

Which of the following is NOT one of the 3Vs of Big Data?

  1. Volume
  2. Velocity
  3. Validity
  4. Variety
Remember
✅ Answer: (C) Validity — The original 3Vs are Volume, Velocity, and Variety. Validity is not part of the standard 3V framework (Veracity and Value are sometimes added as the 4th and 5th V).
Q2

HDFS stands for:

  1. Hadoop Data File Storage
  2. Hadoop Distributed File System
  3. High Data Flow System
  4. Hadoop Dynamic File Structure
Remember
✅ Answer: (B) — HDFS = Hadoop Distributed File System. It distributes and replicates data across multiple nodes in a cluster.
Q3

Which stage of the data science lifecycle involves removing duplicates and handling missing values?

  1. Data Collection
  2. Modelling
  3. Data Cleaning
  4. Deployment
Remember
✅ Answer: (C) Data Cleaning — Also called data wrangling, this stage handles messy, incomplete, and duplicate data. It takes 60–80% of a data scientist's time.
Q4

Tableau is primarily used for:

  1. Database management
  2. Data visualisation and dashboards
  3. Machine learning model training
  4. Web development
Remember
✅ Answer: (B) — Tableau is a visual analytics platform for creating interactive dashboards and reports through drag-and-drop.
Q5

YARN in Hadoop stands for:

  1. Yet Another Resource Negotiator
  2. Yearly Analysis Resource Node
  3. YAML-based Application Resource Network
  4. Yield Adjusted Resource Navigator
Remember
✅ Answer: (A) — YARN = Yet Another Resource Negotiator. It manages and schedules resources across the Hadoop cluster.

Understand / Explain (Q6–Q10)

Q6

Why is the data cleaning stage considered the most time-consuming in the data science lifecycle?

  1. It requires expensive software
  2. Real-world data has inconsistencies, missing values, and duplicates that must be fixed before analysis
  3. It needs approval from management
  4. It involves writing machine learning algorithms
Understand
✅ Answer: (B) — Real-world data is messy. Missing entries, typos, inconsistent formats (e.g., "Mumbai" vs "Bombay" vs "mumbai"), and duplicates must all be resolved before any meaningful analysis.
Q7

How does HDFS ensure data safety when a node in the cluster fails?

  1. It compresses all data before storage
  2. It stores encrypted backups on the cloud
  3. It replicates each data block across multiple nodes (default: 3 copies)
  4. It uses RAID arrays on each machine
Understand
✅ Answer: (C) — HDFS replicates each block across 3 nodes by default. If one node fails, the data is still available from the other copies. This is called fault tolerance.
Q8

Which V of Big Data does Aadhaar's 1.4 billion biometric records best represent?

  1. Velocity
  2. Variety
  3. Volume
  4. Veracity
Understand
✅ Answer: (C) Volume — Aadhaar stores ~30 petabytes of fingerprints, iris scans, and demographic data for 1.4 billion people — a massive volume challenge.
Q9

What is the key advantage of Apache Spark over Hadoop MapReduce?

  1. Spark is free while Hadoop is paid
  2. Spark processes data in-memory, making it up to 100× faster
  3. Spark can only process structured data
  4. Spark doesn't need a cluster
Understand
✅ Answer: (B) — Spark keeps intermediate data in RAM (memory) instead of writing to disk like MapReduce, making iterative algorithms and real-time analytics dramatically faster.
Q10

Explain why Google BigQuery is described as "serverless":

  1. It runs without electricity
  2. Users don't manage any server infrastructure — Google handles everything
  3. It can only be used offline
  4. It doesn't store any data
Understand
✅ Answer: (B) — "Serverless" means you don't provision, manage, or maintain servers. You just write SQL queries and Google's infrastructure handles execution, scaling, and storage automatically.

Apply / Demonstrate (Q11–Q15)

Q11

A Pune coaching centre wants to track student attendance, fee payments, and test scores. Which tool would you recommend for a non-technical owner?

  1. Apache Hadoop
  2. Google Sheets with Pivot Tables
  3. TensorFlow
  4. MongoDB
Apply
✅ Answer: (B) — Google Sheets is free, cloud-based, requires no installation, and supports pivot tables, charts, and conditional formatting — perfect for a non-technical user's dashboard needs.
Q12

You're building a sales dashboard and the "City" column has entries like "Delhi", "delhi", "New Delhi", "DELHI". Which data science lifecycle stage addresses this?

  1. Deployment
  2. Modelling
  3. Data Cleaning
  4. Problem Definition
Apply
✅ Answer: (C) Data Cleaning — Standardising inconsistent text entries (case normalisation, removing duplicates) is a classic data cleaning task.
Q13

To calculate growth rate percentage in Google Sheets where B2=2011 population and C2=2024 population, the correct formula is:

  1. =(B2-C2)/C2*100
  2. =(C2-B2)/B2*100
  3. =C2/B2
  4. =(C2+B2)/2*100
Apply
✅ Answer: (B) — Growth % = (New - Old) / Old × 100. So (C2-B2)/B2*100 gives the percentage increase from 2011 to 2024.
Q14

Swiggy wants to identify which restaurants have the highest cancellation rates in Mumbai. Which visualisation type is most appropriate?

  1. Pie chart of all restaurants
  2. Horizontal bar chart — top 20 restaurants sorted by cancellation rate
  3. Scatter plot of revenue vs profit
  4. Line chart of daily temperatures
Apply
✅ Answer: (B) — A sorted horizontal bar chart makes it easy to compare cancellation rates across restaurants and quickly identify the worst performers.
Q15

A freelance client asks you to create a dashboard showing monthly expenses across 5 categories. You have data in CSV format. Describe the correct workflow:

  1. Build an ML model → Deploy on AWS → Send link
  2. Import CSV into Google Sheets → Clean data → Create pivot table → Build charts → Share link
  3. Print the CSV and highlight numbers manually
  4. Upload to GitHub and write Python code
Apply
✅ Answer: (B) — For a simple client dashboard, Google Sheets is the fastest and most client-friendly approach. Import, clean, pivot, chart, share — done in 2 hours.

Analyze / Compare (Q16–Q20)

Q16

Compare Hadoop MapReduce and Apache Spark. In which scenario would Hadoop MapReduce still be preferred over Spark?

  1. Real-time fraud detection at Razorpay
  2. Processing overnight batch jobs on a low-RAM cluster with petabytes of log data
  3. Building a recommendation engine for Swiggy
  4. Interactive data exploration in Jupyter notebooks
Analyze
✅ Answer: (B) — MapReduce writes to disk, so it works well on clusters with limited RAM. For massive overnight batch processing where speed isn't critical, MapReduce's disk-based approach is more cost-effective.
Q17

Differentiate between structured, semi-structured, and unstructured data with Indian examples:

  1. All data is structured if stored in a computer
  2. Structured = Excel/SQL tables (Aadhaar records); Semi-structured = JSON/XML (Swiggy API responses); Unstructured = images, videos, tweets (Instagram posts)
  3. Unstructured data cannot be analysed
  4. Semi-structured data is always stored in Hadoop
Analyze
✅ Answer: (B) — Structured data has a fixed schema (rows/columns), semi-structured has tags/keys but flexible format, and unstructured has no predefined format.
Q18

Examine why UPI transaction data represents the "Velocity" V of Big Data rather than "Volume":

  1. UPI doesn't generate much data
  2. UPI data is generated at extreme speed (3,800+ transactions per second), requiring real-time processing rather than just large storage
  3. UPI only processes text data
  4. UPI uses Hadoop for storage
Analyze
✅ Answer: (B) — While UPI generates large volumes too, its defining characteristic is speed — thousands of transactions per second that must be validated, processed, and settled in near real-time.
Q19

Compare Excel and Tableau for a freelance data dashboard project. When would you choose Excel over Tableau?

  1. When the client needs interactive web-based dashboards
  2. When the client already uses Excel daily and needs a simple solution they can maintain themselves
  3. When the dataset exceeds 50 million rows
  4. When geographic map visualisations are required
Analyze
✅ Answer: (B) — If a client (say a local shop owner) already knows Excel, building a dashboard there means they can update and maintain it without your help. Tableau requires a separate tool they may not be familiar with.
Q20

Analyse the difference between a Data Analyst and a Data Scientist role at an Indian IT company:

  1. They are the same role with different titles
  2. Data Analysts focus on descriptive analytics (what happened), while Data Scientists build predictive models (what will happen)
  3. Data Scientists only use Excel
  4. Data Analysts earn more than Data Scientists
Analyze
✅ Answer: (B) — Analysts answer "what happened?" using SQL, dashboards, and reports. Scientists answer "what will happen?" using ML models, statistical modelling, and experimentation.

Evaluate / Judge (Q21–Q25)

Q21

UIDAI claims Aadhaar data is completely secure. A journalist discovers that Aadhaar numbers of 100 million Indians were leaked through an unsecured government API. Evaluate the failure:

  1. The data was too big to secure
  2. Hadoop was the wrong technology
  3. The failure was in API security and access control, not in the database technology itself
  4. Aadhaar should not collect data at all
Evaluate
✅ Answer: (C) — The technology (database) was not the issue. The vulnerability was in how APIs exposed data without proper authentication and rate limiting. Security is a design and governance problem, not just a technology problem.
Q22

A startup CEO says: "We don't need data cleaning. Our ML model will learn from noisy data." Judge this statement:

  1. Correct — modern ML handles everything automatically
  2. Incorrect — "Garbage in, garbage out" applies. Noisy data leads to inaccurate models with poor predictions
  3. Partially correct — only deep learning needs clean data
  4. Correct if they use Hadoop
Evaluate
✅ Answer: (B) — This is a dangerous misconception. Even the best ML algorithm will produce unreliable results if trained on dirty data. Data cleaning is non-negotiable.
Q23

Paytm uses customer transaction data to show personalised loan offers. Evaluate the ethical implications:

  1. No ethical issues — it's just marketing
  2. Potential concerns: user consent, data privacy under DPDP Act, targeting vulnerable users with high-interest loans, algorithmic bias
  3. Ethical only if Paytm earns profit
  4. Unethical because Paytm is a private company
Evaluate
✅ Answer: (B) — Using financial data for targeted marketing raises consent, privacy, and fairness concerns. The DPDP Act 2023 requires explicit consent. Predatory lending targeting low-income users is an ethical red flag.
Q24

An Indian company has 500 GB of sales data and is considering building an on-premise Hadoop cluster vs using Google BigQuery. Justify the better choice:

  1. Hadoop — because it's open-source and free
  2. BigQuery — because 500 GB is small, BigQuery's free tier covers it, no infrastructure management needed, and results come in seconds
  3. Neither — Excel can handle 500 GB
  4. Hadoop — because it's faster than cloud
Evaluate
✅ Answer: (B) — 500 GB is too small to justify the cost/complexity of a Hadoop cluster (designed for petabytes). BigQuery processes 500 GB in seconds, costs almost nothing, and requires zero infrastructure setup.
Q25

Critique the claim: "India doesn't need data scientists because AI will automate all analysis."

  1. Valid — AI replaces all human analysis
  2. Invalid — AI automates routine tasks but humans are needed for problem framing, ethical judgement, domain expertise, and communicating insights to business stakeholders
  3. Valid for small companies only
  4. Invalid because AI doesn't work in India
Evaluate
✅ Answer: (B) — AI tools like ChatGPT can write SQL or generate charts, but they cannot define the right business problem, ensure ethical data use, or present findings convincingly to a non-technical CEO. The human role evolves, not disappears.

Create / Design (Q26–Q30)

Q26

Design a data collection strategy for a college canteen that wants to reduce food waste. Which approach is most comprehensive?

  1. Ask students to fill a Google Form once a year
  2. Track daily menu items, quantity prepared, quantity consumed, leftover weight, day of week, weather, and special events — using a simple Google Sheet updated by canteen staff
  3. Install an AI camera system costing ₹10 lakh
  4. Count the number of students entering the canteen
Create
✅ Answer: (B) — This captures the right variables (what was made, what was wasted, contextual factors) using a free tool. It's practical, low-cost, and gives actionable insights when analysed with pivot tables and charts.
Q27

Propose a dashboard design for IRCTC to display real-time train booking statistics. What elements should it include?

  1. Just a table with numbers
  2. Live counter of bookings/minute, heat map of top routes, bar chart of class-wise bookings (1AC/2AC/SL), alerts for high-demand trains, and a trend line of daily bookings
  3. A pie chart of all 13,000 trains
  4. A single number showing total bookings today
Create
✅ Answer: (B) — A comprehensive dashboard combines real-time metrics (counters), geographic context (heat maps), categorical breakdowns (bar charts), anomaly detection (alerts), and temporal trends (line charts).
Q28

You're creating a Fiverr gig for "Excel Dashboard Creation for Indian Small Businesses." Which gig description is most likely to attract clients?

  1. "I know Excel. Contact me."
  2. "I will create a professional, interactive Excel dashboard with charts, pivot tables, and conditional formatting for your sales, inventory, or financial data. Delivery in 48 hours. Includes 1 revision. Starting at ₹2,000."
  3. "Expert in Hadoop and Spark. Can build anything."
  4. "Free dashboards for everyone!"
Create
✅ Answer: (B) — A good gig description is specific (what you'll do), includes deliverables (charts, pivot tables), sets expectations (48 hours, 1 revision), and has a clear price. This is professional and client-ready.
Q29

Build a mini data pipeline for a neighbourhood kirana store. Which sequence is correct?

  1. Deploy ML model → Collect data → Build dashboard
  2. Define problem (which products expire unsold?) → Collect daily sales data via Google Sheets → Clean & organise → Analyse with pivot tables → Create dashboard showing slow-moving products → Present to owner with recommendations
  3. Buy Hadoop servers → Hire 5 data scientists → Start analysis
  4. Build a mobile app → Collect biometric data → Use AI
Create
✅ Answer: (B) — This follows the data science lifecycle correctly (Problem → Collect → Clean → Analyse → Visualise → Act) using appropriate tools for a small business (Google Sheets, not Hadoop).
Q30

Design a data-driven solution for Zomato to identify restaurants that are likely to shut down within 6 months. What data points would you propose collecting?

  1. Only the restaurant name
  2. Monthly order volume trend, average rating trend, customer complaint frequency, delivery time reliability, menu price changes, competitor density in the area, owner response rate to reviews
  3. Just the current star rating
  4. GPS coordinates only
Create
✅ Answer: (B) — Predicting restaurant closure requires multiple signals: declining orders (revenue trend), dropping ratings (quality trend), increasing complaints (service issues), and contextual factors (competition, pricing). This is a real data science problem at Zomato.
Section H

Short Answer Questions (2–3 marks each)

Question 1 (2 marks)

Define the 3Vs of Big Data and give one Indian example for each.

Question 2 (3 marks)

Explain the difference between HDFS and MapReduce in Hadoop. How do they work together to process large datasets?

Question 3 (2 marks)

Why is data cleaning considered the most time-consuming step in the data science lifecycle? Give two real-world examples of dirty data.

Question 4 (3 marks)

Compare Apache Spark and Hadoop MapReduce on three parameters: speed, ease of use, and real-time processing capability.

Question 5 (2 marks)

List any four job roles in Data Science and mention one key skill required for each role.

Section I

Long Answer & Case Studies (10 marks each)

📋 Case Study 1: Aadhaar — The World's Largest Biometric Database (10 marks)

India's Aadhaar system, managed by UIDAI, stores biometric data (fingerprints, iris scans) and demographic information for over 1.4 billion residents. It processes over 100 million authentication requests daily for services like bank account verification, SIM card activation, and government subsidy disbursement (DBT).

Answer the following:

  1. (2 marks) Identify which V of Big Data is the most significant challenge for Aadhaar. Justify your answer with data.
  2. (3 marks) Describe the data pipeline that runs when a citizen uses Aadhaar for eKYC at a bank. Cover data flow from the fingerprint scanner to the authentication response.
  3. (2 marks) Discuss two data privacy concerns with Aadhaar and how the DPDP Act 2023 addresses them.
  4. (3 marks) Propose a data analytics dashboard for UIDAI that shows: authentication success rates by state, peak usage hours, and failure categories. Sketch the dashboard layout and specify chart types.

📋 Case Study 2: UPI/NPCI — Real-Time Transaction Analytics at 10 Billion Scale (10 marks)

India's Unified Payments Interface (UPI), managed by NPCI, processed 13.89 billion transactions worth ₹20.64 lakh crore in a single month (March 2024). Apps like PhonePe, Google Pay, and Paytm together handle ~3,800 transactions per second during peak hours.

Answer the following:

  1. (2 marks) Explain why UPI's primary Big Data challenge is Velocity rather than Volume. How does this affect technology choices?
  2. (3 marks) Design a real-time fraud detection data pipeline for UPI. Specify: what data points you'd collect per transaction, what analytics you'd run, and what action the system should take when fraud is detected.
  3. (2 marks) Compare how NPCI could use Hadoop vs Spark for transaction analytics. Which is more appropriate and why?
  4. (3 marks) Create a data-driven strategy for NPCI to predict UPI downtime before it happens. What historical data would you analyse? What patterns would indicate impending failure?
Section J

Chapter Summary — Tweet-Sized Bullet Points

📝 Key Takeaways

  • 📊 Data Science lifecycle: Problem → Collect → Clean → Analyse → Model → Deploy → Monitor. Cleaning takes 60-80% of time.
  • 📦 Big Data = 3Vs: Volume (Aadhaar = 30PB), Velocity (UPI = 3,800 txn/sec), Variety (text + images + video).
  • 🐘 Hadoop = distributed storage (HDFS) + parallel processing (MapReduce) + resource management (YARN).
  • ⚡ Spark is 100× faster than MapReduce because it processes data in-memory, not on disk.
  • 📈 Tableau = drag-and-drop visual analytics. Free version (Tableau Public) is perfect for portfolios.
  • ☁️ Cloud Big Data (BigQuery, AWS EMR) eliminates infrastructure hassle. BigQuery gives 1 TB/month free.
  • 💼 Fastest career path: Excel + SQL + Tableau → Data Analyst at ₹4-6 LPA. No coding required initially.
  • 💰 Start earning NOW: Build Excel dashboards for local businesses. First gig: ₹2,000-₹8,000 on Internshala.
  • 🇮🇳 India needs 1 million+ data professionals by 2026 — massive opportunity for BCA/B.Tech students.
  • 🏗️ Portfolio piece from this chapter: "India Population Trend Dashboard 2011-2024" on Google Sheets.
Section K

My Earning Checkpoint — Self-Assessment

Skill LearnedTool PractisedPortfolio Item AddedGig Ready?
Data Science LifecycleGoogle SheetsIndia Population Dashboard✅ Yes — can explain lifecycle to clients
3Vs of Big DataConceptual✅ Yes — can discuss in interviews
Data CleaningGoogle Sheets / ExcelCleaned Census Dataset✅ Yes — cleaning is a billable skill
Dashboard CreationGoogle Sheets, Tableau PublicPopulation Dashboard + COVID Dashboard✅ Yes — ₹2,000–₹8,000/project
Data VisualisationCharts, Pivot TablesBar charts, Pie charts, Filters✅ Yes — ready for Internshala gigs
Hadoop/Spark ConceptsConceptual (no hands-on yet)⬜ Not yet — need PySpark practice
Data Pipeline DesignGoogle Docs (proposal)Business Data Pipeline Proposal✅ Yes — can pitch to local businesses
Minimum Viable Earning Setup after this chapter: A Google Sheets/Tableau portfolio with 2 dashboards + an Internshala/Fiverr profile with a clear gig description = you can earn ₹5,000–₹15,000/month from dashboard gigs while still in college.

✅ Unit 1 complete. Ready for Unit 2: AI & Machine Learning!

[QR: Link to EduArtha video tutorial — Data Science & Big Data]