尊敬的 微信汇率:1円 ≈ 0.046166 元 支付宝汇率:1円 ≈ 0.046257元 [退出登录]
SlideShare a Scribd company logo
1
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
A TOOL AGNOSTIC APPROACH
2
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
LET’S TAKE A DATASET
3
Each row has details about an employee who has left the organization.
Just “reading” the dataset is quite informative.
DESCRIBE THE DATA IN A STRUCTURED WAY
4
5
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
CATEGORICAL COLUMNS YIELD VERY LITTLE DATA
6
There’s not much information in one column.
The values are not quantitative,
so a distribution is not meaningful.
The values are not even ordered.
In fact, the only thing we have is the list of values
and their count.
... or is there more to this?
Region Count
India 10780
Headstrong 1554
China 1130
Philippines 1030
US 792
Romania 788
Mexico 324
Guatemala 233
Poland 124
Brazil 45
Hungary 41
Colombia 38
Netherlands 33
South Africa 30
UK 18
UAE 15
GMS India 15
Japan 11
CZECH Republic 10
Kenya 9
... BUT RANK FREQUENCY IS STILL POSSIBLE
7
The rank of the row provides additional
information.
With this, we can explore the distribution
of the rank against the count.
These distributions are called rank-
frequency distributions.
Rank Region Count
1 India 10780
2 Headstrong 1554
3 China 1130
4 Philippines 1030
5 US 792
6 Romania 788
7 Mexico 324
8 Guatemala 233
9 Poland 124
10 Brazil 45
11 Hungary 41
12 Colombia 38
13 Netherlands 33
14 South Africa 30
15 UK 18
16 UAE 15
17 GMS India 15
18 Japan 11
19 CZECH Republic 10
20 Kenya 9
REGION SHOWS A POWER LAW DISTRIBUTION
8
Region Count
India 10780
Headstrong 1554
China 1130
Philippines 1030
US 792
Romania 788
Mexico 324
Guatemala 233
Poland 124
Brazil 45
Hungary 41
Colombia 38
Netherlands 33
South Africa 30
UK 18
UAE 15
GMS India 15
Japan 11
CZECH Republic 10
Kenya 9
Rank on a log scale
Frequencyonalogscale
COST CODE SHOWS A POWER LAW DISTRIBUTION
9
Cost Code Count
105 9542
121 1757
125 875
122 796
3001 654
3310 635
124 435
131 415
115 336
nan 207
101 205
127 173
109 148
116 91
126 66
...
LE SHOWS A POWER LAW DISTRIBUTION
10
LE Count
D84 11487
GPL 853
RM1 789
LC2 565
GMR 323
D95 247
GUT 233
ML1 223
CTK 184
AXE 127
A38 98
A21 79
EMP 61
BRL 45
A66 43
...
11
WHAT CAUSES
POWER LAW DISTRIBUTIONS?
PREFERENTIAL
ATTACHMENT
EXPONENTIAL
GROWTH
NO. OF FOLLOWERS ON GITHUB
12
Username Count
slidenerd 1700
astaxie 1320
MugunthKumar 1081
honcheng 870
arunoda 827
csjaba 670
cheeaun 658
timoxley 600
karlseguin 600
hemanth 514
arvindr21 400
yuvipanda 335
mbrochh 330
anandology 330
sayanee 314
zz85 314
sanand0 309
captn3m0 300
sameersbn 300
...
NO. OF MOVIES ACTED IN BY BOLLYWOOD PEOPLE
13
Person Count
Lata Mangeshkar 824
Asha Bhosle 810
Shakti Kapoor 589
Kishore Kumar 585
Mohammed Rafi 527
Sunidhi Chauhan 515
Alka Yagnik 451
Udit Narayan 435
Kader Khan 430
Sonu Nigam 405
Sameer 398
Asrani 397
Helen 395
Shaan 377
Aruna Irani 375
Anupam Kher 367
Shreya Ghoshal 357
Gulshan Grover 341
...
PARTIES IN PARLIAMENT ELECTIONS
14
Name Count
IND 44704
INC 7213
BJP 3354
BSP 2628
SP 1311
CPI 1102
JD 943
CPM 914
DDP 716
JNP 676
BJS 657
JP 563
NOTA 543
PSP 538
INC(I) 492
SHS 467
AAP 432
SWA 410
...
CANDIDATE NAMES IN ASSEMBLY ELECTIONS
15
Name Count
NONE OF THE ABOVE 629
OM PRAKASH 478
ASHOK KUMAR 411
RAM SINGH 362
RAJ KUMAR 294
ANIL KUMAR 271
AMAR SINGH 248
MOHAN LAL 235
RAM KUMAR 224
BABU LAL 218
RAM PRASAD 213
JAGDISH 210
VIJAY KUMAR 207
RAJENDRA SINGH 196
VINOD KUMAR 195
SHYAM LAL 193
RAJESH KUMAR 186
SITA RAM 186
RAM LAL 171
...
STUDENT NAMES IN SSA SURVEY
16
Name Count
M.MANIKANDAN 99
S.PAVITHRA 84
S.MANIKANDAN 84
R.RAMYA 82
S.SANGEETHA 70
R.MANIKANDAN 69
S.DIVYA 68
M.PAVITHRA 68
S.SANTHIYA 67
S.VIGNESH 67
M.PRIYA 67
M.MAHALAKSHMI 64
S.SARANYA 63
S.SURYA 60
K.MANIKANDAN 60
P.PAVITHRA 56
S.GAYATHRI 56
P.MANIKANDAN 55
...
Jain
Harini
Shweta
Sneha Pooja
Ashwin
Shah
Deepti
Sanjana
Varshini
Ezhumalai
Venkatesan
Silambarasan
Pandiyan
Kumaresan
Manikandan
Thirupathi
Agarwal
Kumar
Priya
NOT EVERYTHING IS POWER-LAW, THOUGH
18
Need to understand what drives these distributions from their behaviours
ORDERED CATEGORICALS HAVE MORE INFORMATION
19
CORPORATE BAND
20
LE Count
5 12247
4 4449
3 205
2 63
Not Mapped 24
1 22
SVP 10
LOCAL BAND
21
LE Count
5A 7483
5B 4764
4A 1683
4B 1612
4C 747
4D 407
3 205
2 63
Not Mapped 24
1 22
SVP 10
QUANTITIES HAVE EVEN MORE INFORMATION
22
AGE DISTRIBUTION IS LOG-NORMAL
23
DETECTING FRAUD
“
We know meter readings are
incorrect, for various reasons.
We don’t, however, have the
concrete proof we need to start
the process of meter reading
automation.
Part of our problem is the
volume of data that needs to be
analysed. The other is the
inexperience in tools or
analyses to identify such
patterns.
ENERGY UTILITY
24
This plot shows the frequency of all meter readings from Apr-
2010 to Mar-2011. An unusually large number of readings are
aligned with the tariff slab boundaries.
This clearly shows collusion
of some form with the
customers.
Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
217 219 200 200 200 200 200 200 200 350 200 200
250 200 200 200 201 200 200 200 250 200 200 150
250 150 150 200 200 200 200 200 200 200 200 150
150 200 200 200 200 200 200 200 200 200 200 50
200 200 200 150 180 150 50 100 50 70 100 100
100 100 100 100 100 100 100 100 100 100 110 100
100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 50
0 100 27 100 50 100 100 100 100 100 70 100
1 1 1 100 99 50 100 100 100 100 100 100
This happens with specific
customers, not randomly.
Here are such customers’
meter readings.
Section Apr-10 May-10Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%
Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%
Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%
Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%
Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%
Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%
Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%
Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%
Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of
fraud” as the percentage
excess of the 100 unit
meter reading, the
value varies
considerably
across sections,
and time
New section
manager arrives
… and is
transferred out
… with some
explainable
anomalies.
Why would
these happen?
25
PREDICTING MARKS
“
What determines a child’s marks?
Do girls score better than boys?
Does the choice of subject matter?
Does the medium of instruction
matter?
Does community or religion
matter?
Does their birthday matter?
Does the first letter of their name
matter?
EDUCATION
26
TN CLASS X: ENGLISH
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 27
TN CLASS X: SOCIAL SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 28
TN CLASS X: LANGUAGE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 29
TN CLASS X: SCIENCE
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 30
TN CLASS X: MATHEMATICS
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 31
ICSE 2013 CLASS XII: TOTAL MARKS
32
CBSE 2013 CLASS XII: ENGLISH MARKS
33
CBSE 2013 CLASS XII: PHYSICS MARKS
34
35
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS
LET’S TAKE ONE DAY CRICKET DATA
Country Player Runs ScoreRate MatchDate Ground Versus
Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England
Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka
Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand
India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka
New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India
Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India
West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan
West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India
Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia
Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand
Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand
Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe
Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India
England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India
India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka
Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland
Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia
Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan
New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India
Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa
South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe
36
Against which countries are
higher averages scored?
Which countries’ players
score more per match?
37
Which player scores the
most per ball?
The player with the highest strike
rate is an obscure South African
whose name most of us have never
heard of.
In fact, this list is filled with players
we have never heard of.
38
Most analysis answers the question
“Which is are the top 10 X”?
Which are my top products?
Which are my top branches?
Who are my best sales people?
Which vendors have the highest cost per unit?
Which divisions are spending the most money?
In which hours does the under 12 segment watch TV most?
Which customer segment has the highest revenue per user?
39
THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY
Country Player Runs ScoreRate MatchDate Ground Versus
Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England
Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka
Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand
India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka
New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India
Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India
West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan
West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India
Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia
Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand
Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand
Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe
Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India
England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India
India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka
Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland
Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia
Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan
New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India
Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa
South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe
Take every column in the data
Find the top value by that column
Country South Africa has the highest strike rate of 76%
Player Johann Louw has the highest strike rate of 329%
Runs 164 runs has the highest strike rate of 156%
MatchDate 12-03-2006 has the highest strike rate of 136%
Ground AC-VDCA Stadium has the highest strike rate of 98%
Versus United States has the highest strike rate of 104%
40
What do the children in schools know and can do at
different stages of elementary education?
Have the inputs made into the elementary education
system had a beneficial effect or not?
41
HAVING BOOKS IMPROVES READING ABILITY
Having more books at home improves the performance of children when it
comes to reading. (But children typically only have only 1-10 books at home)
Number of students sampled
What is the impact? How many more marks
can having more books fetch?
Circle size indicates number of students with
this response. Few students have no books.
Is this response (“25+ books”) good or bad?
Small red bars indicate low marks. Large
green bars indicate high marks. Students
having 25+ books tend to score high marks.
The most common response is marked in
blue. This is also the circle.
The graphic is summarized in words
Indicates whether the best response is the
most popular. Blue means that it is not.
Green means that it is. Red means that the
worst level is the most popular response.
42
CHILDREN LIKE GAMES, AND THEY’RE GOOD
… but playing daily hurts reading ability
43
WATCHING TV OCCASIONALLY IS GOOD
Children who watch TV
every day don’t do as well
as children who watch TV
only once a week.
But children who never
watch TV fare the worst.
Watching TV every day
helps improve children’s
reading ability a little bit
more…
… but mathematical
abilities fall dramatically at
that point
44
WE HAVE A WEBSITE THAT YOU CAN EXPLORE
GRAMENER.COM/NAS
45
46
AUTOMATING DATA EXPLORATION
A structured approach to analysing data
METADATA
UNIVARIATE
ANALYSIS
BIVARIATE
ANALYSIS

More Related Content

Similar to Automating Data Exploration SciPy 2016

Making Big Data relevant: Importance of Data Visualization and Analytics
Making Big Data relevant: Importance of Data Visualization and AnalyticsMaking Big Data relevant: Importance of Data Visualization and Analytics
Making Big Data relevant: Importance of Data Visualization and Analytics
Gramener
 
HYDSPIN Dec14 visual story telling
HYDSPIN Dec14 visual story tellingHYDSPIN Dec14 visual story telling
HYDSPIN Dec14 visual story telling
Gramener
 
Storytelling through data
Storytelling through dataStorytelling through data
Storytelling through data
Gramener
 
Econometrics Project
Econometrics ProjectEconometrics Project
Econometrics Project
Uday Tharar
 
healthcare healthcare statistics.pdf
healthcare healthcare statistics.pdfhealthcare healthcare statistics.pdf
healthcare healthcare statistics.pdf
sdfghj21
 
Forecasting Visitation
Forecasting VisitationForecasting Visitation
MLSEV Virtual. Anomaly Detection Examples
MLSEV Virtual. Anomaly Detection ExamplesMLSEV Virtual. Anomaly Detection Examples
MLSEV Virtual. Anomaly Detection Examples
BigML, Inc
 
Visual Intelligence @ IBS, Hyderabad
Visual Intelligence @ IBS, HyderabadVisual Intelligence @ IBS, Hyderabad
Visual Intelligence @ IBS, Hyderabad
Gramener
 
Editors Lab Delhi
Editors Lab DelhiEditors Lab Delhi
Editors Lab Delhi
Gramener
 
histgram[1].ppt
histgram[1].ppthistgram[1].ppt
histgram[1].ppt
ssuserb036e8
 
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
Alexandru_V_Sima_January_03_2022_December_16_2022.pdfAlexandru_V_Sima_January_03_2022_December_16_2022.pdf
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
AlexandruSima8
 
AP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One SampleAP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One Sample
Frances Coronel
 
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
 q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
aryan532920
 
Automating Analysis and Visualizing Machine Learning
Automating Analysis and Visualizing Machine LearningAutomating Analysis and Visualizing Machine Learning
Automating Analysis and Visualizing Machine Learning
Gramener
 
Database Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago districDatabase Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago distric
Demin Wang
 
Statistical quality control
Statistical quality controlStatistical quality control
Statistical quality control
Sai Datri Arige
 
4 5b Histograms
4 5b Histograms4 5b Histograms
4 5b Histograms
taco40
 
Derivative daily report
Derivative daily reportDerivative daily report
Derivative daily report
Money Classic Research
 
Weekly equity market report for 14 to 18 may
Weekly equity market report for 14 to 18 mayWeekly equity market report for 14 to 18 may
Weekly equity market report for 14 to 18 may
TheEquicom Advisory
 
Active portfolio Management and Construction - With an investment Strategy.....
Active portfolio Management and Construction - With an investment Strategy.....Active portfolio Management and Construction - With an investment Strategy.....
Active portfolio Management and Construction - With an investment Strategy.....
2K13A19
 

Similar to Automating Data Exploration SciPy 2016 (20)

Making Big Data relevant: Importance of Data Visualization and Analytics
Making Big Data relevant: Importance of Data Visualization and AnalyticsMaking Big Data relevant: Importance of Data Visualization and Analytics
Making Big Data relevant: Importance of Data Visualization and Analytics
 
HYDSPIN Dec14 visual story telling
HYDSPIN Dec14 visual story tellingHYDSPIN Dec14 visual story telling
HYDSPIN Dec14 visual story telling
 
Storytelling through data
Storytelling through dataStorytelling through data
Storytelling through data
 
Econometrics Project
Econometrics ProjectEconometrics Project
Econometrics Project
 
healthcare healthcare statistics.pdf
healthcare healthcare statistics.pdfhealthcare healthcare statistics.pdf
healthcare healthcare statistics.pdf
 
Forecasting Visitation
Forecasting VisitationForecasting Visitation
Forecasting Visitation
 
MLSEV Virtual. Anomaly Detection Examples
MLSEV Virtual. Anomaly Detection ExamplesMLSEV Virtual. Anomaly Detection Examples
MLSEV Virtual. Anomaly Detection Examples
 
Visual Intelligence @ IBS, Hyderabad
Visual Intelligence @ IBS, HyderabadVisual Intelligence @ IBS, Hyderabad
Visual Intelligence @ IBS, Hyderabad
 
Editors Lab Delhi
Editors Lab DelhiEditors Lab Delhi
Editors Lab Delhi
 
histgram[1].ppt
histgram[1].ppthistgram[1].ppt
histgram[1].ppt
 
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
Alexandru_V_Sima_January_03_2022_December_16_2022.pdfAlexandru_V_Sima_January_03_2022_December_16_2022.pdf
Alexandru_V_Sima_January_03_2022_December_16_2022.pdf
 
AP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One SampleAP Statistics - Confidence Intervals with Means - One Sample
AP Statistics - Confidence Intervals with Means - One Sample
 
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
 q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
q.ur.hr,r, L3oDtscusstoN QUESIoNS AND PROBLEMS 145C.docx
 
Automating Analysis and Visualizing Machine Learning
Automating Analysis and Visualizing Machine LearningAutomating Analysis and Visualizing Machine Learning
Automating Analysis and Visualizing Machine Learning
 
Database Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago districDatabase Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago distric
 
Statistical quality control
Statistical quality controlStatistical quality control
Statistical quality control
 
4 5b Histograms
4 5b Histograms4 5b Histograms
4 5b Histograms
 
Derivative daily report
Derivative daily reportDerivative daily report
Derivative daily report
 
Weekly equity market report for 14 to 18 may
Weekly equity market report for 14 to 18 mayWeekly equity market report for 14 to 18 may
Weekly equity market report for 14 to 18 may
 
Active portfolio Management and Construction - With an investment Strategy.....
Active portfolio Management and Construction - With an investment Strategy.....Active portfolio Management and Construction - With an investment Strategy.....
Active portfolio Management and Construction - With an investment Strategy.....
 

More from Gramener

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision
Gramener
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer Vision
Gramener
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
Gramener
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in Manufacturing
Gramener
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save Biodiversity
Gramener
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen Plugin
Gramener
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects
Gramener
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI Products
Gramener
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program
Gramener
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad Performance
Gramener
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : Webinar
Gramener
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: Webinar
Gramener
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best Outcomes
Gramener
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - Webinar
Gramener
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar
Gramener
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
Gramener
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Gramener
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)
Gramener
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data Science
Gramener
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to Stories
Gramener
 

More from Gramener (20)

6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision6 Methods to Improve Your Manufacturing Process with Computer Vision
6 Methods to Improve Your Manufacturing Process with Computer Vision
 
Detecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer VisionDetecting Manufacturing Defects with Computer Vision
Detecting Manufacturing Defects with Computer Vision
 
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & HealthcareHow to Identify the Right Key Opinion Leaders (KOLs) in Pharma  & Healthcare
How to Identify the Right Key Opinion Leaders (KOLs) in Pharma & Healthcare
 
Automated Barcode Generation System in Manufacturing
Automated Barcode Generation System in ManufacturingAutomated Barcode Generation System in Manufacturing
Automated Barcode Generation System in Manufacturing
 
The Role of Technology to Save Biodiversity
The Role of Technology to Save BiodiversityThe Role of Technology to Save Biodiversity
The Role of Technology to Save Biodiversity
 
Enable Storytelling with Power BI & Comicgen Plugin
Enable Storytelling with Power BI  & Comicgen PluginEnable Storytelling with Power BI  & Comicgen Plugin
Enable Storytelling with Power BI & Comicgen Plugin
 
The Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science ProjectsThe Most Effective Method For Selecting Data Science Projects
The Most Effective Method For Selecting Data Science Projects
 
Low Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI ProductsLow Code Platform To Build Data & AI Products
Low Code Platform To Build Data & AI Products
 
5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program5 Key Foundations To Build An Effective CX Program
5 Key Foundations To Build An Effective CX Program
 
Using Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad PerformanceUsing Power BI To Improve Media Buying & Ad Performance
Using Power BI To Improve Media Buying & Ad Performance
 
Recession Proofing With Data : Webinar
Recession Proofing With Data : WebinarRecession Proofing With Data : Webinar
Recession Proofing With Data : Webinar
 
Engage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: WebinarEngage Your Audience With PowerPoint Decks: Webinar
Engage Your Audience With PowerPoint Decks: Webinar
 
Structure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best OutcomesStructure Your Data Science Teams For Best Outcomes
Structure Your Data Science Teams For Best Outcomes
 
Dawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - WebinarDawn Of Geospatial AI - Webinar
Dawn Of Geospatial AI - Webinar
 
5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar5 Steps To Become A Data-Driven Organization : Webinar
5 Steps To Become A Data-Driven Organization : Webinar
 
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar 5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
5 Steps To Measure ROI On Your Data Science Initiatives - Webinar
 
Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020Saving Lives with Geospatial AI - Pycon Indonesia 2020
Saving Lives with Geospatial AI - Pycon Indonesia 2020
 
Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)Driving Transformation in Industries with Artificial Intelligence (AI)
Driving Transformation in Industries with Artificial Intelligence (AI)
 
The Art of Storytelling Using Data Science
The Art of Storytelling Using Data ScienceThe Art of Storytelling Using Data Science
The Art of Storytelling Using Data Science
 
Storyfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to StoriesStoryfying your Data: How to go from Data to Insights to Stories
Storyfying your Data: How to go from Data to Insights to Stories
 

Recently uploaded

CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
frp60658
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
newdirectionconsulta
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
sapna sharmap11
 
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
rukmnaikaseen
 
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
Douglas Day
 
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in LucknowCall Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
hiju9823
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
nhero3888
 
Health care analysis using sentimental analysis
Health care analysis using sentimental analysisHealth care analysis using sentimental analysis
Health care analysis using sentimental analysis
krishnasrigannavarap
 
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
ThinkInnovation
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
nitachopra
 
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
wwefun9823#S0007
 
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
uthkarshkumar987000
 
PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
incitbe
 
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdfsaps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
newdirectionconsulta
 
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
mparmparousiskostas
 
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts ServicePune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
vashimk775
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
GeorgiiSteshenko
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
davidpietrzykowski1
 
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In BangaloreBangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
yashusingh54876
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
ranjeet3341
 

Recently uploaded (20)

CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
 
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call GirlCall Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
Call Girls Goa (india) ☎️ +91-7426014248 Goa Call Girl
 
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
🔥College Call Girls Kolkata 💯Call Us 🔝 8094342248 🔝💃Top Class Call Girl Servi...
 
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
202406 - Cape Town Snowflake User Group - LLM & RAG.pdf
 
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in LucknowCall Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
Call Girls Lucknow 8923113531 Independent Call Girl Service in Lucknow
 
Bangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts ServiceBangalore ℂall Girl 000000 Bangalore Escorts Service
Bangalore ℂall Girl 000000 Bangalore Escorts Service
 
Health care analysis using sentimental analysis
Health care analysis using sentimental analysisHealth care analysis using sentimental analysis
Health care analysis using sentimental analysis
 
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
 
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
Call Girls In Tirunelveli 👯‍♀️ 7339748667 🔥 Safe Housewife Call Girl Service ...
 
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
Independent Call Girls In Bangalore 9024918724 Just CALL ME Book Beautiful Gi...
 
PCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdfPCI-DSS-Data Security Standard v4.0.1.pdf
PCI-DSS-Data Security Standard v4.0.1.pdf
 
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdfsaps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
 
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
Optimizing Feldera: Integrating Advanced UDFs and Enhanced SQL Functionality ...
 
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts ServicePune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
Pune Call Girls <BOOK> 😍 Call Girl Pune Escorts Service
 
Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)Telemetry Solution for Gaming (AWS Summit'24)
Telemetry Solution for Gaming (AWS Summit'24)
 
Salesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - CanariasSalesforce AI + Data Community Tour Slides - Canarias
Salesforce AI + Data Community Tour Slides - Canarias
 
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In BangaloreBangalore Call Girls  ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
Bangalore Call Girls ♠ 9079923931 ♠ Beautiful Call Girls In Bangalore
 
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
 

Automating Data Exploration SciPy 2016

  • 1. 1 AUTOMATING DATA EXPLORATION A structured approach to analysing data A TOOL AGNOSTIC APPROACH
  • 2. 2 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  • 3. LET’S TAKE A DATASET 3 Each row has details about an employee who has left the organization. Just “reading” the dataset is quite informative.
  • 4. DESCRIBE THE DATA IN A STRUCTURED WAY 4
  • 5. 5 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  • 6. CATEGORICAL COLUMNS YIELD VERY LITTLE DATA 6 There’s not much information in one column. The values are not quantitative, so a distribution is not meaningful. The values are not even ordered. In fact, the only thing we have is the list of values and their count. ... or is there more to this? Region Count India 10780 Headstrong 1554 China 1130 Philippines 1030 US 792 Romania 788 Mexico 324 Guatemala 233 Poland 124 Brazil 45 Hungary 41 Colombia 38 Netherlands 33 South Africa 30 UK 18 UAE 15 GMS India 15 Japan 11 CZECH Republic 10 Kenya 9
  • 7. ... BUT RANK FREQUENCY IS STILL POSSIBLE 7 The rank of the row provides additional information. With this, we can explore the distribution of the rank against the count. These distributions are called rank- frequency distributions. Rank Region Count 1 India 10780 2 Headstrong 1554 3 China 1130 4 Philippines 1030 5 US 792 6 Romania 788 7 Mexico 324 8 Guatemala 233 9 Poland 124 10 Brazil 45 11 Hungary 41 12 Colombia 38 13 Netherlands 33 14 South Africa 30 15 UK 18 16 UAE 15 17 GMS India 15 18 Japan 11 19 CZECH Republic 10 20 Kenya 9
  • 8. REGION SHOWS A POWER LAW DISTRIBUTION 8 Region Count India 10780 Headstrong 1554 China 1130 Philippines 1030 US 792 Romania 788 Mexico 324 Guatemala 233 Poland 124 Brazil 45 Hungary 41 Colombia 38 Netherlands 33 South Africa 30 UK 18 UAE 15 GMS India 15 Japan 11 CZECH Republic 10 Kenya 9 Rank on a log scale Frequencyonalogscale
  • 9. COST CODE SHOWS A POWER LAW DISTRIBUTION 9 Cost Code Count 105 9542 121 1757 125 875 122 796 3001 654 3310 635 124 435 131 415 115 336 nan 207 101 205 127 173 109 148 116 91 126 66 ...
  • 10. LE SHOWS A POWER LAW DISTRIBUTION 10 LE Count D84 11487 GPL 853 RM1 789 LC2 565 GMR 323 D95 247 GUT 233 ML1 223 CTK 184 AXE 127 A38 98 A21 79 EMP 61 BRL 45 A66 43 ...
  • 11. 11 WHAT CAUSES POWER LAW DISTRIBUTIONS? PREFERENTIAL ATTACHMENT EXPONENTIAL GROWTH
  • 12. NO. OF FOLLOWERS ON GITHUB 12 Username Count slidenerd 1700 astaxie 1320 MugunthKumar 1081 honcheng 870 arunoda 827 csjaba 670 cheeaun 658 timoxley 600 karlseguin 600 hemanth 514 arvindr21 400 yuvipanda 335 mbrochh 330 anandology 330 sayanee 314 zz85 314 sanand0 309 captn3m0 300 sameersbn 300 ...
  • 13. NO. OF MOVIES ACTED IN BY BOLLYWOOD PEOPLE 13 Person Count Lata Mangeshkar 824 Asha Bhosle 810 Shakti Kapoor 589 Kishore Kumar 585 Mohammed Rafi 527 Sunidhi Chauhan 515 Alka Yagnik 451 Udit Narayan 435 Kader Khan 430 Sonu Nigam 405 Sameer 398 Asrani 397 Helen 395 Shaan 377 Aruna Irani 375 Anupam Kher 367 Shreya Ghoshal 357 Gulshan Grover 341 ...
  • 14. PARTIES IN PARLIAMENT ELECTIONS 14 Name Count IND 44704 INC 7213 BJP 3354 BSP 2628 SP 1311 CPI 1102 JD 943 CPM 914 DDP 716 JNP 676 BJS 657 JP 563 NOTA 543 PSP 538 INC(I) 492 SHS 467 AAP 432 SWA 410 ...
  • 15. CANDIDATE NAMES IN ASSEMBLY ELECTIONS 15 Name Count NONE OF THE ABOVE 629 OM PRAKASH 478 ASHOK KUMAR 411 RAM SINGH 362 RAJ KUMAR 294 ANIL KUMAR 271 AMAR SINGH 248 MOHAN LAL 235 RAM KUMAR 224 BABU LAL 218 RAM PRASAD 213 JAGDISH 210 VIJAY KUMAR 207 RAJENDRA SINGH 196 VINOD KUMAR 195 SHYAM LAL 193 RAJESH KUMAR 186 SITA RAM 186 RAM LAL 171 ...
  • 16. STUDENT NAMES IN SSA SURVEY 16 Name Count M.MANIKANDAN 99 S.PAVITHRA 84 S.MANIKANDAN 84 R.RAMYA 82 S.SANGEETHA 70 R.MANIKANDAN 69 S.DIVYA 68 M.PAVITHRA 68 S.SANTHIYA 67 S.VIGNESH 67 M.PRIYA 67 M.MAHALAKSHMI 64 S.SARANYA 63 S.SURYA 60 K.MANIKANDAN 60 P.PAVITHRA 56 S.GAYATHRI 56 P.MANIKANDAN 55 ...
  • 18. NOT EVERYTHING IS POWER-LAW, THOUGH 18 Need to understand what drives these distributions from their behaviours
  • 19. ORDERED CATEGORICALS HAVE MORE INFORMATION 19
  • 20. CORPORATE BAND 20 LE Count 5 12247 4 4449 3 205 2 63 Not Mapped 24 1 22 SVP 10
  • 21. LOCAL BAND 21 LE Count 5A 7483 5B 4764 4A 1683 4B 1612 4C 747 4D 407 3 205 2 63 Not Mapped 24 1 22 SVP 10
  • 22. QUANTITIES HAVE EVEN MORE INFORMATION 22
  • 23. AGE DISTRIBUTION IS LOG-NORMAL 23
  • 24. DETECTING FRAUD “ We know meter readings are incorrect, for various reasons. We don’t, however, have the concrete proof we need to start the process of meter reading automation. Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns. ENERGY UTILITY 24
  • 25. This plot shows the frequency of all meter readings from Apr- 2010 to Mar-2011. An unusually large number of readings are aligned with the tariff slab boundaries. This clearly shows collusion of some form with the customers. Apr-10 May-10Jun-10Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 217 219 200 200 200 200 200 200 200 350 200 200 250 200 200 200 201 200 200 200 250 200 200 150 250 150 150 200 200 200 200 200 200 200 200 150 150 200 200 200 200 200 200 200 200 200 200 50 200 200 200 150 180 150 50 100 50 70 100 100 100 100 100 100 100 100 100 100 100 100 110 100 100 150 123 123 50 100 50 100 100 100 100 100 0 111 100 100 100 100 100 100 100 100 50 50 0 100 27 100 50 100 100 100 100 100 70 100 1 1 1 100 99 50 100 100 100 100 100 100 This happens with specific customers, not randomly. Here are such customers’ meter readings. Section Apr-10 May-10Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109% Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54% Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34% Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14% Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15% Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33% Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14% Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17% Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11% If we define the “extent of fraud” as the percentage excess of the 100 unit meter reading, the value varies considerably across sections, and time New section manager arrives … and is transferred out … with some explainable anomalies. Why would these happen? 25
  • 26. PREDICTING MARKS “ What determines a child’s marks? Do girls score better than boys? Does the choice of subject matter? Does the medium of instruction matter? Does community or religion matter? Does their birthday matter? Does the first letter of their name matter? EDUCATION 26
  • 27. TN CLASS X: ENGLISH 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 27
  • 28. TN CLASS X: SOCIAL SCIENCE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 28
  • 29. TN CLASS X: LANGUAGE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 29
  • 30. TN CLASS X: SCIENCE 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 30
  • 31. TN CLASS X: MATHEMATICS 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 31
  • 32. ICSE 2013 CLASS XII: TOTAL MARKS 32
  • 33. CBSE 2013 CLASS XII: ENGLISH MARKS 33
  • 34. CBSE 2013 CLASS XII: PHYSICS MARKS 34
  • 35. 35 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS
  • 36. LET’S TAKE ONE DAY CRICKET DATA Country Player Runs ScoreRate MatchDate Ground Versus Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe 36
  • 37. Against which countries are higher averages scored? Which countries’ players score more per match? 37
  • 38. Which player scores the most per ball? The player with the highest strike rate is an obscure South African whose name most of us have never heard of. In fact, this list is filled with players we have never heard of. 38
  • 39. Most analysis answers the question “Which is are the top 10 X”? Which are my top products? Which are my top branches? Who are my best sales people? Which vendors have the highest cost per unit? Which divisions are spending the most money? In which hours does the under 12 segment watch TV most? Which customer segment has the highest revenue per user? 39
  • 40. THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY Country Player Runs ScoreRate MatchDate Ground Versus Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe Take every column in the data Find the top value by that column Country South Africa has the highest strike rate of 76% Player Johann Louw has the highest strike rate of 329% Runs 164 runs has the highest strike rate of 156% MatchDate 12-03-2006 has the highest strike rate of 136% Ground AC-VDCA Stadium has the highest strike rate of 98% Versus United States has the highest strike rate of 104% 40
  • 41. What do the children in schools know and can do at different stages of elementary education? Have the inputs made into the elementary education system had a beneficial effect or not? 41
  • 42. HAVING BOOKS IMPROVES READING ABILITY Having more books at home improves the performance of children when it comes to reading. (But children typically only have only 1-10 books at home) Number of students sampled What is the impact? How many more marks can having more books fetch? Circle size indicates number of students with this response. Few students have no books. Is this response (“25+ books”) good or bad? Small red bars indicate low marks. Large green bars indicate high marks. Students having 25+ books tend to score high marks. The most common response is marked in blue. This is also the circle. The graphic is summarized in words Indicates whether the best response is the most popular. Blue means that it is not. Green means that it is. Red means that the worst level is the most popular response. 42
  • 43. CHILDREN LIKE GAMES, AND THEY’RE GOOD … but playing daily hurts reading ability 43
  • 44. WATCHING TV OCCASIONALLY IS GOOD Children who watch TV every day don’t do as well as children who watch TV only once a week. But children who never watch TV fare the worst. Watching TV every day helps improve children’s reading ability a little bit more… … but mathematical abilities fall dramatically at that point 44
  • 45. WE HAVE A WEBSITE THAT YOU CAN EXPLORE GRAMENER.COM/NAS 45
  • 46. 46 AUTOMATING DATA EXPLORATION A structured approach to analysing data METADATA UNIVARIATE ANALYSIS BIVARIATE ANALYSIS

Editor's Notes

  1. We did the simplest possible thing – plot the number of customers who had meter readings of 0, 1, 2, 3, etc. – all the way up to 300 and beyond. (Effectively, we drew a histogram.) As expected, it was log-normal. Relatively few users with low meter readings, and few with high meter readings. But what was striking were the spikes – at 50 units, 100 units, 200 units and 300 units – precisely at the slab boundaries. Given the metering system, there is a strong economic incentive to stay at or within a slab boundary. Exceeding it increases the unit rate. However, there are two ways this could happen. Either the consumer watches their meter carefully, and the instant it hits 100, stops using their lights and fans – or a certain amount of money changes hands. It was easy to see from this that there was fraud happening, but what stumped us were the spikes at 10, 20, 30, 40, etc. Here, there’s no economic incentive. There’s no significant difference between a meter reading of 10 vs 11, so there was no incentive to commit fraud. However, we later learnt that we were looking at this the wrong way. This was not a case of fraud, but of laziness. These were the meter readings taken by staff that never visited the premises, and were cooking up numbers. When people cook up numbers, they cook up round numbers. (An official said that he had to let go of one person who had not taken readings in a colony of houses for as long as six months. “Sir, there’s a pack of dogs in the colony” was his official statement.) The other question is, what is the nature of this fraudulent contract. Is it monthly? The meter reading guy appears and charges a small sum to adjust the reading? Or is it an annual contract that’s paid upfront? We looked at the meter readings of some of the people who were consistently at the slab boundaries. For example, the table in the middle has the readings of 10 customers, one per row. In the first row, the readings are consistently at 200 for 9 of the 12 months. However, there’s a spike in Jan-11 to 350 units. This indicated a monthly contract with a failure to pay in just one month. However, we later learnt that many of the people on this list were famous personalities. In fact, the lady in the first row had an event at their place in Jan-11, and the actual reading was expected to be well over a thousand units. But since the electricity board has a policy of not often auditing those that were in the highest slab (above 300), a more likely explanation was a collusion of the lineman with the customer to place her in the highest slab just this month, to avoid scrutiny. Lastly, we were examining the level at which fraud can be controlled. The last table above shows the extent of fraud of each section in one city, month on month. (The extent of fraud can be measured by the relative height of the spikes compared to the expected value.) Sections vary in the level of fraud, with Section 1 having significantly more fraud than Section 9. We also observe that fraud generally decreases in the winter season (Dec – Feb) when the need for cooling is less. But what’s most striking is the negative fraud in Section 5 in Jun-10. It stays low for a couple of months, and then, as if to compensate, shoots up to 82% in Sep-10. We learnt that this coincided with the appointment and transfer of a new section manager – under whose “regime”, fraud seems to have been dramatically controlled. It appears that a good organisation level to control fraud is at the 5,000 people strong section manager level, rather than the 100,000 people strong staff level.
  翻译: