Statistics is the science of data. The term statistics is derived from the New Latin statisticum collegium (“council of state”) and the Italian word statista (“statesman”). In a statistical investigation, it is known that for reasons of time or cost, one may not be able to study each individual element (of population).
Statistics deals with the collection, classification, analysis, and interpretation of data. Statistics provide us with an objective approach to do this. There are several statistical techniques available for learning from data. One needs to note that the scope of statistical methods is much wider than only statistical…
Data acquisition is all about obtaining the artifacts that contain the input
data from a variety of sources, extracting the data from the artifacts, and
converting it into representations suitable for further processing.
The three main sources of data are the Internet (namely, the World Wide Web), databases, and local files (possibly previously downloaded by hand or using additional software).
Not All Data Is Created Equal
Although we’d like to believe in the veracity and quality of every dataset we see, not all datasets will measure up to our expectations. Even datasets we currently use could prove to be ineffective and inefficient sources after further research. As we explore automated solutions to the data wrangling problems we face, we will find the tools Python can help determine good versus bad data and help out the viability of our data.
Readability, Cleanliness, and Longevity
We can use Python to help us read illegible data, but the illegibility may mean the data…
Time Series Analysis — Introduction
Weather, stock markets, and heartbeats. They all form time series. If you’re interested in diverse data and forecasting the future, you’re interested in time series analysis.
Time series data spans a wide range of disciplines and use cases. It can be anything from customer purchase histories to conductance measurements of a nano-electronic system to digital recordings of human language. One point we discuss throughout the book is that time series analysis applies to a surprisingly diverse set of data.
Time series analysis is the endeavor of extracting meaningful summary and statistical information from points…
In this discussion, we look at a particular and very important type of choice in data modeling. In fact, it is so important that we introduce a special convention subtyping to allow our E-R diagrams to show several different options at the same time. We will also find subtyping useful for concisely representing rules and constraints, and for managing complexity. Our emphasis in this discussion is on the conceptual modeling phase, and we touch only lightly on logical modeling issues.
Different Levels of Generalization
It is important to recognize that our choice of level of generalization will have a…
The focus of this discussion is on ensuring that the data meets business requirements.
Much of the discussion is devoted to the correct use of terminology and diagramming conventions, which provide a bridge between technical and business views of data requirements.
A Diagrammatic Representation
The fact that each operation can be performed by only one surgeon (because each row of the Operation table allows only one surgeon number) is an important constraint imposed by the data model, but is not immediately apparent.
Process modelers solve this sort of problem by using diagrams, such as data flow diagrams and activity…
The principal tool is normalization, a set of rules for allocating data to tables in such a way as to eliminate certain types of redundancy and incompleteness.
Normalization is usually one of the later activities in a data modeling project, as we cannot start normalizing until we have established what columns (data items) are required.
Normalization is used in the logical database design stage, following requirements analysis and conceptual modeling.
An Informal Example of Normalization
Normalization is essentially a two-step process:
1. Put the data into a tabular form (by removing repeating groups).
2. Remove duplicated data to separate…
What is a data model?
Data Modeling refers to the practice of documenting software and business system design. The “modeling” of these various systems and processes often involves the use of diagrams, symbols, and textual references to represent the way the data flows through a software application or the Data Architecture within an enterprise.
Why Is the Data Model Important?
When designing programs or report layouts (for example), we generally settle for a design that “does the job” even though we recognize that with more time and effort we might be able to develop a more elegant solution.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.
And for this practice I want to prove how great apache spark.
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
2. Create Connection to PostgreSQL.
conn = psycopg2.connect(host=’localhost’, database=’postgres’,user =’postgres’,password=’postgres’)
cur = conn.cursor()
3. Load the data.
sqlctx = SQLContext(sc)
pop_data = sqlctx.read.csv(‘ratings.csv’)
4. Create table in PostgreSQL.
Data Analytics. Enthusiast in all things data, personal finance, and Fintech.