Site Loader
Get a Quote
Rock Street, San Francisco

Nikhil Agrawal and Vishwanand Vyas

Every environment has different
factors based on which we can analyze
the factors such as medical concerns,
resource allocations, mortality factors
and various others. Information is
collected from the analysis of Causes
of Death. This helps in analysis of
death caused by lack of vaccination,
environmental factors, and natu ral
causes. By categorizing the data on the
basis of age we can also determine the
major causes of death in a particular
age group which in turn helps hospitals
and doctors to prepare accordingly. It
will also help clinical practitioners to
make accurate h ealth care decisions
and public policy makers to make
better policy. Statistical methods have
been widely used in studies of public
health. Although useful in clinical
research and public health policy
making, these methods could not find
correlation among health conditions
automatically, or capture the temporal
evolution of causes of death correctly.
To cope with two challenges above, we
implement the unsupervised machine
learning method “topic model” to
study the state death data.
Data mining (also known as
knowledge discovery from databases)
is the process of extraction of hidden,
previously unknown and potentially
useful information from databases.
The outcome of the extracted data can
be analyzed for the future planning and
development pers pectives. With the
help of data mining techniques like
regression, clustering, prediction and
other we will be able to device a
pattern and then analyses the given
data set from the census. The
exponentially increasing amounts of
data being generated each year make
getting useful information from that
data more and more critical. The
information frequently is stored in a
data warehouse, a repository of data
gathered from various sources,
including corporate databases,
summarized information from internal
sy stems, and data from external
sources. Analysis of the data includes
simple query and reporting, statistical
analysis, more complex
multidimensional analysis, and data
Data Mining : Data mining can be
defined as the process of extracting
data, analy zing it from many
dimensions or perspectives, then
producing a summary of the
information in a useful form that
identifies relationships within the data.
There are two types of data mining:
descriptive, which gives information
about existing data; and pred ictive,

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now


which makes forecasts based on the
For all analytic tools, it is important to
keep business goals in mind, both in
selecting and deploying tools and in
using them. In putting these tools to
use, it is helpful to look at where they
fit into the decision -making processes.
The five steps in decision -making can
be identified as follows:
• Develop standard reports.
• Identify exceptions; unusual
situations and outcomes that
indicate potential problems or
• Identify causes of the exceptions.
• Develop models for possible
• Track effectiveness.


There can be multiple causes of deaths
and it opens a wide field to be
analyzed. Mohammad Hossein
Saraee et al. In their paper they have
worked on the deaths caused in
children due to the accidents and have
given suggestion on how the death
rates due to accidents can be brought
down. They have used techniques like
Feature Selection, c&r tree algorithm,
Blank Handling, Pruning, bayesian
network to anayze the causes. The
data mining methods in use are
decision tree and Bayes’ theorem. The
author concluded that the rate of
mortality in patients who get primary
treatment before arriving to the
emergency ward of hospital, is very
lower than those without any care
before reaching the hospital, and this
point emphasizes on value of pre –
hospital care, necessity of ambulance
outfit and importance of learning the
first aids via public media. James K.
Tamgno et al. Developed a mobile app
to collect and store the data regarding
the medical autopsies. The system will
automate the collection, the sending
and the insertion of the data in a data
base centralized. Verbal Autopsies
allows identification of major health
problems, comparisons of local and
national differences in mortality ratios,
the monitoring of trends over time, and
the evaluation of interventions and
health programs. Hanyu Jiang, Hang
Wu et their paper have used
techniques like Trend analysis, Topic
model, Biterm topic model, Coherence
Score to analyze the causes of deaths
in united states of America between
the years 1999 -2014.The different
topics are first clustered and then
coherence score is used to measure the
quality of discovered topics. They
have used the death certificates of
people to collect the mortality data and
thus is limited to the availability of
death certificates. Their aim is to
provide feedback of their analysis to
clinical practitioner and public health
policymaker to provid e better health
care services. Hesham Abdo Ahmed
Aqlan et al. Have used web mining
techniques to predict about the
likelihoods of future death and disease


events of interest.They used large –
scale digital histories which are
captured for duration of 18 years from
news reports of Queensland
Government archive to make real -time
predictions. Death prediction is
performed using various classifiers
and results are analyzed using error as
a metric. They have done modelling of
data and and forecasting of events
which is also known as time series
analysis and time series forecasting.
They have proposed death prediction
and analysis of deaths in this paper.
They have used four different me thods
for prediction of deaths from large
scale data. Ogochukwu C.Okeke et al
have done census analysis to give geo
spatial distribution for Nigeria which
is a very populous country. Their
effort is towards harnessing the power
of data -mining technique to develop
mining model applicable to the
analysis of census data. They aim to
provide government with the
intelligence for strategic planning,
tactical decision -making, better policy
formulation and for better -informed
business decisions. They have used
decision tree for doing this analysis.
The techniques used were decision tree
algorithm, structured system analysis
and design methodology. With the
proposed system their aim is to provide
better facilities to all the citizens of
Nigeria who might not be ge tting the
proper facilities and exposure.
Munaza Ramzan in his paper has
done the classification and
characterization for the treatment
strategies of critical diseases like
cancer, cardiovascular diseases and
diabetes which are leading causes of
death worl dwide. He has used data
mining techniques such as weka to do
the analysis of the medical data. Some
of the other techniques used are J48,
naive -bayes and random forest. His
work evaluates the disease
categorization using three different
machine learning a lgorithms by
WEKA Tool. His works shows that
Random forest is the best classifier for
disease categorization because it runs
efficiently on large data sets. Du
Zhang et al. Have done mining of vital
statistics data on causes of deaths in
California state. They have used data
mining tool called Cubist which is
used to build predictive models out of
two million cases over a nine -year
period. They have used techniques like
committee model, cubist model and
data selection strategies to do the
analysis. The ob jective of their study
is to discover knowledge that can be
used to gain insight into various
aspects of mortality in California, to
predict health issues related to the
causes of death, to offer an aid to
decision – or policy -making process,
and to pro vide useful information
services to the customers.


In order to perform the present system
for analysis of the statistics we require
knowledge in the following fields.


1.Data Mining
3.Python or R programming
4.Data Analysis Tools

Tools to represent the statistical data in
the form of graphical diagrams for
better understanding and prediction.

4.1.DATA MINING: Knowledge in
the field of data mining is crucial for
working the project as it involves
various technique s such as regression,
Classification and Clustering.

Regression is a technique used to
model and analyze the relationships
between variables and often times how
they contribute and are related to
producing a particular outcome
together. A linear regression refers to a
regression model that is completely
made up of linear variables. Beginning
with the simple case, Single Variable
Linear Regression is a technique used
to model the relationship between a
single input independent variable
(feature var iable) and an output
dependent variable using a linear
model .

Classification is a data function that
assigns items in a collection to target
categories or classes. The goal of
classification is to accurately predict
the target class for each case in the
data. Classification are discrete and do
not imply order. Continuous, floating –
point values would indicate
a numerical, rather than a categorical,
target. A predictive model with a
numerical target uses a regression
algorithm, not a classification

Cluster analysis or clustering is the
task of grouping a set of objects in such
a way that objects in the same group
(called a cluster ) are more similar (in
some sense) to each other than to those
in other groups (clusters).


In the data set the mathematical
technique such as topic models,
average estimation and segregation all
require the knowledge in the field of
statistics. Mathematical knowledge is
required to formulate and standardize
the data set into diffe rent classes based
on which the data mining algorithms
are carried out.

4.3 Python ; R programming

R and Python are both open -source
programming languages with a large
community. New libraries or tools are
added continuously to their respective
catalog. R is mainly used for statistical
analysis while Python provides a more
general approach to data science .

R and Python are state of the art in
terms of programming language
oriented towards data science.
Learning both of them is, of course, the
ideal solution. R and Python requires a


time -investment, and such luxury is
not available for everyone. Python is a
general -purpose language with a
readable syntax. R, however, is built
by statisticians and encompasses their
specific language.


Weka tool: An advantage of
using Weka is that it is easy to learn.
Being a machine learning tool, its
interface is intuitive enough for you to
get the job done quickly. It provides
options for data pre -processing,
classification, regression, c lustering,
association rules and visualization.
Most of the steps you think of while
model building can be achieved
using Weka. It’s built on Java.

Datawrapper Tool: Datawrapper is an
online data -visualization to ol for
making interactive charts. Once you
upload the data from CSV/PDF/Excel
file or paste it directly into the field,
Datawrapper will generate a bar, line,
map or any other related visualization.
Datawrapper graphs can be embedded
into any website or CM S with ready –
to -use embed codes. So many reporters
and news organizations use
Datawrapper to embed live charts into
their articles. It is very easy to use and
produces effective graphics.


In this project we are going to analyze
the environment based on the causes of
deaths. There can be numerous causes
of deaths in an area and thus using the
process of data mining we are going to
find the major causes of deaths. We are
going to take data se t of a developing
country which is also very populous. In
very populous regions due to high
death rates the causes of death
sometimes go unnoticed until its too
big. This project is going to help
identify the cause a lot earlier. The
project is based on a nalyzing the data
set from the W.H.O and formulating a
model using techniques of
classification, clustering, regression
and bit term topic model. For
understanding the dataset we are going
to represent the refined dataset in a
graphical method by using the
visualization tools like WEKA.
Initially when a data is collected it may
have many redundant data and
improper entries. Feature selection
algorithm is the technique which is
used for removing the redundant data.
Then we are going to use classification
tec hnique to classify the data and then
use clustering technique to cluster the
data into different data sets(Trend
analysis). Once the data is segregated
based on the different death causes we
can then represent data using bar graph
and pie charts. The diffe rent topics
which are clustered together will be
analyzed using coherence score.
Coherence score will give the quality
of the topics i.e. Its factor of influence
can be determined using coherence
score. The representation of data in
graphical way will show us the major


factors of death and we can then
suggest the methods to bring down the
deaths due to the major causes. The
aim of our project is to increase the
standard of living in the highly
populated areas by reducing the death
rates and providing better facilities.

1.The analysis will provide results
based on which we can make
suggestions to the government to
improve the management and services
in terms of medical supplies provided
to the affected area.
2. We will be able to un derstand the
major factors responsible for major
diseases and death and their effects in
the area.
3. Our analysis will also help to reflect
upon the crime rates and deaths due to
accidents in the area and thus
immediate authorities can be informed
about the same to bring down the crime
rate and to improve the traffic system
in the area.
4. Upon taking necessary action on the
problem we can see the results of the
steps taken in the next survey and
make suggest the necessary changes in
the management system making it
more efficient.


The analysis will provide loopholes in
the present system and upon
rectification of the problem we can
focus on the overall development in
the medical standards of the area
making the environment healthy will
later increase the economic value of
the area a nd the residents of the area
will gain benefits such as increased
property values and decreased medical
diseases and rodent infestation. Not
only will the project help in economic
way but will also increase the area
environmental values by decreasing
the c ontaminants and the disease
spreading animals and insect.

From the management point of view it
will be efficient for distributing the
medical supply and needs to the areas
based on the results rather than provide
some with more supplies and other
with les s supplies. From the result
analysis the need for the medical
supply and specifically for which
diseases will be explained and
provided as the result of our project.


In this report, we have studied how to
extract and analy ze the import ant
inf ormation from the data se t. We have
also learned how to represent the vas t
datase t into a graphica l repr esentation
by using certain influence factor s
procured by analyzing the data set .
Based on the representation an d the
deri vation from the graphical
representation we can then take the
necessary step required to ensure th at
the e ffi ciency of the system is
increased and the target objective is
obtained with some room for
improvements and f uture upgrades.


The pro posed model of analysis is tend
to be improvise d in the future and has
great scope for the upcoming
te chnological advancements with
every thing becoming d igitalized and
increasing the ease with which
management of the area can be


1. Hanyu Jiang, Hang Wu ,
Student Member , and May
Dongmei Wang , Ph.D., on
“Causes of Death in the United
States, 1999 to 2017″in
IEEE, 2017

2. A. Jemal, E. Ward, Y. Hao,
and M. J. Thun, “Trends in the
leading causes of death in the
United States, 1970 -2002,”
JAMA, vol. 294, no.10, pp.
1255 –1259, 2005.

3. Hesham Abdo Ahmed Aqlan,
Shoiab Ahmed and Ajit Danti
on “Death Prediction
and Analysis Using Web Mining
Techniques ” published on 2017
International Conference on
Advanced Computing and
Communication Systems
(ICACCS -2015), Jan. 06 – 07,
2017, Coimbatore, INDIA

4. Gerami Farzad, Bartashak
Masoumeh, Kourosh
Honarmand, “Prediction of
Workplace Accidents with
Knowledge Discovery
Approach Using Weka
software,” Nova Explore
Publications, Nova Journal of
Engineering and Applied
Sciences, Vol . 2(5), May2014.
5. Du Zhang, Quoc Luan Ha and
Meiliu Lu on ” Mining
California Vital Statistics Data ”
published in 2001 IEEE.
6. Munaza Ramzan on
“Comparing and Evaluating the
Performance of WEKA
Classifiers on Critical Diseases”
published on 2016 IEEE.
7. Mr. Chi ntan Shah, Dr. Anjali
G. Jivani , on “Comparison of
Data Mining Classification
Algorithms for health

Post Author: admin


I'm Lillian

Would you like to get a custom essay? How about receiving a customized one?

Check it out

I'm Camille!

Would you like to get a custom essay? How about receiving a customized one?

Check it out