The history of data analysis dates back more than 1000 BC during the Egyptian era and is more relevant than ever in 2018. This shows how important data analysis is and will continue to be.
Data analysis refers to the technique of collecting raw data, analysing it and transforming it into information that can be used to reach a specific conclusion.
Let’s see the various stages of data analysis and for this consider an example where a student has to analyse whether private or government schools are better.
- The initial or the first phase would be Setting Objective, it means you should know the reason behind collecting and analysing the data. In this case our objective is to prove which schooling system is better.
- Once the objective is set, now the next phase can be divided into two parts; the first part is to determine what data to be analysed and the next is how it has to be done.
When it comes to “What data to be Analysed”, considering the example, it will be quality of education provided, successful passed, basic knowledge of subjects grade wise, etc
For “How data has to be Analysed”, parameters like what benchmark to select for checking the quality of education, questions that need to be included to test the basic subject knowledge, etc
- As the objective and the parameters are set, the next phase is collect or gather all the information that are available relevant to the second phase, it’s not necessary that all the information be useful, but in this phase, the importance is to only collect as much as data as possible.
- In this phase all the collected data is segregated, all the unnecessary information is discarded and the useful ones are organized. The best way to do this by asking the following questions:
1. Does the collected data provide satisfactory answers to the question about schools?
2. Will the data collected prove useful to defend counter questions?
Once all stages are completed, you will have a well organized data that will help you find a solution to the question that we took as an example.
How do you improve your data analysis skills on a daily basis?
Few basic habits can actually make a big difference. Following the below habits can have a huge impact on how you develop your skills:
- Gather: This is where you need to go through materials related to Data Analytics, there are various tutorials available online, read books, watch videos, etc
- Documentation: Once you gather information, it is important to retain that and what’s better than making your own notes? They can be just scribblings about the highlight of a particular topic in Data Analytics, this will help you go back to ideas that arose during the tutorial.
- Art of Application: Whatever information you have, it is important to apply it practically, try to come up with scenarios and find a solution to it.
One of the most important part that an analyst overlooks is the use of proxies.
Proxies can help in the following ways:
– Automating the process of extracting the data without worrying about the IP getting blocked
– Spoofing location and getting geo-specific data
Big Data, as the name suggests is a collection of large data, this collection can be in the form of structured or unstructured data.
This data can help an organization to get insights that can further help them in taking data driven decisions, thus improving the overall progress of the company.
Earlier Big Data was often recognized by 3Vs but these have expanded and now can be characterized by 6Vs.
1. The data collected from different sources which are relevant to the organization can be characterized as a volume
2. Information comes in a variety of formats. They range from organized, conventional databases, also unstructured content, email, video, sound, etc.
3. The third most important characteristic is the speed or velocity at which the data is collected.
4. One of the additional characteristics is veracity, this means the level of authenticity of the collected data.
5. Once the data is collected and structured, the next important characteristic to measure is its value i.e the value that it holds to the organization
6. Variability helps us understand the various ways the collected data can be used.
How does Big Data leave an impact?
The size or the volume of data collected is not important but how the collected information is utilized is what matters.
The data that is collected can be analyzed to get information or answers to cost effectiveness, proper time management, data driven decisions and when big data and such analytics are mixed together, you can easily determine:
- The exact reason for system failure, what are the pressing issues and other effects almost in real time
- Analyzing the behavior of the potential buyer and thus improving or personalizing the sign up process
- With such a massive amount of data, it is easy to know the fraud patterns and can be blocked before it gives a massive blow
Who uses Big Data:
1. Banking Sector: When it comes to banking, there is always a large amount of data that flows in all the time. Banking sector usually uses this information to be to improve security and better the customer experience. But banks have to make use of powerful analytics tool to take real advantage of this data.
2. Schooling and overall education: Usually the need for big data analysis is played down when it comes to education. But it can turn out to be very crucial steps towards improving the overall education system. Big data analysis can easily show key results related to student’s performance, the average passing percentage and what other improvements can be done in the system.
3. Healthcare Industry: This industry or rightly put as “service” benefits with their patients’ data, like their previous diagnosis, medication taken or that is being used and any other patient record. Analysing these data can prove crucial for an effective treatment.
4. Manufacturing: When large volumes of data from the manufacturing industry are analysed, it helps in improving the quality of the product, customer service and helps companies to realize market trends and what a customer wants.
5. Administration: Administration can also be defined as a large body with a lot of subdivisions and the data generated from each department can be humongous. To provide a better way of life to the people of the region, to understand their issues and to be able to quickly implement
It is a software that allows the processing of Big data across a lot of computers using a simple program.
The problems with Big data and how Hadoop solves it
1. Storage: Nowadays the size of big data for an organization grows exponentially and it is not cost effective in putting resources in the high storage server.
Resolution: Hadoop uses HDFS i.e Hadoop Distributed File System which can store data in different hardware but process the same data parallelly.
2. High volume of data: This creates an issue because the data that is received can be structured, unstructured or semi-structured and especially in different formats
Resolution: This issue is again resolved by Hadoop Distributed File System as there is no pre-dumpling schema validation, so whenever a new data is out under HDFS, there is no need to define the schema.
3. Computing power: The size of data that usually comes in nowadays is in Terabytes and it will take a great amount of time to process all these data. For example, you have 2 TB of data and a computing power of 1Gbps, so the total time taken to process all these data will be around 34 minutes and this will significantly increase if the volume of data is very high
Resolution: Hadoop uses a cluster of computers or it’s functioning and hence the processing of data is run parallelly and this significantly reduces the computing time.
Let’s dive a bit into the technical aspect of Hadoop configuration. The following are the steps for a single node installation.
– Hadoop needs a working Java 1.5+ (aka Java 5) installation.
user@ubuntu:~$ sudo apt-get update #Updating the source list
or Install it
user@ubuntu:~$ sudo apt-get install sun-java6-jdk
Adding a dedicated Hadoop system user
user@ubuntu:~$ sudo addgroup hadoop_group user@ubuntu:~$ sudo adduser --ingroup hadoop_group hdusert
This will add the user hdusert and the group hadoop_group to the local machine.
Add hdusert to the sudo group
user@ubuntu:~$ sudo adduser hdusert sudo
A key has to be generated for the hduser user
user@ubuntu:~$ su – hdusert hdusert@ubuntu:~$ ssh-keygen -t rsa -P ""
NOTE: P “”, here indicates an empty password
This will create an RSA key pair with an empty password.
Now SSH access has to be enabled for your local machine with this created key and is done by the following command.
hdusert@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The first step is to switch to hduser
hduser@ubuntu:~$ su - hdusert
Next step is to download and extract Hadoop 1.2.0. Also, Setup Environment Variables for Hadoop
export HADOOP_HOME=/usr/local/hadoop # Add Hadoop bin/ directory to PATH export PATH= $PATH:$HADOOP_HOME/bin
Change the file: conf/hadoop-env.sh
in the following file
# export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64 (for 64 bit) # export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64 (for 32 bit)
Create the directory and set the required ownerships and permissions
hduser@ubuntu:~$ sudo mkdir -p /app/hadoop/tmp hduser@ubuntu:~$ sudo chown hduser:hadoop /app/hadoop/tmp hduser@ubuntu:~$ sudo chmod 750 /app/hadoop/tmp
Paste the following between <configuration>
In file conf/core-site.xml
<property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
In file conf/mapred-site.xml
<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>
In file conf/hdfs-site.xml
<property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property>
Now Format the HDFS filesystem via the NameNode
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode –format
Finally, starting the single node cluster
hduser@ubuntu:~$ sudo chmod -R 777 /usr/local/hadoop
Run the following command
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on the machine.
Secure big data analysis:
As there are tons of data that are downloaded every day, the challenges related to security also increases many folds. Few of them are listed below:
Cloud storage has facilitated a way to store a large amount of data but it comes with privacy and security issues. You should be careful whom you select as your cloud provider and get the details of the security setup before purchase
This is also a very important security feature that is overlooked that involves not only deciding which users get access to the data but also how much access is provided to each user
When data is collected on such a massive scale, there are chances that sensitive information gets mixed up and leaked.
Big data is a boon is so many ways, it can help an organization evaluate their performance, make corrections and optimize their delivery.
They should also secure the data from a breach and other security vulnerability. When the data is collected there should be real time monitoring and as the information stored runs into Terabytes of data, proper measures should be in place to safeguard it. These are the few steps that can be undertaken for secure Big Data Analysis