How to Perform Secure Data Analysis at Scale?

H

The history of data analysis dates back more than 1000 BC during the Egyptian era and is more relevant than ever in 2018. This shows how important data analysis is and will continue to be.

Data analysis refers to the technique of collecting raw data, analysing it and transforming it into information that can be used to reach a specific conclusion.

Let’s see the various stages of data analysis and for this consider an example where a student has to analyse whether private or government schools are better.

  • The initial or the first phase would be Setting Objective, it means you should know the reason behind collecting and analysing the data. In this case our objective is to prove which schooling system is better.
  • Once the objective is set, now the next phase can be divided into two parts; the first part is to determine what data to be analysed and the next is how it has to be done.
    When it comes to “What data to be Analysed”, considering the example, it will be quality of education provided, successful passed, basic knowledge of subjects grade wise, etc
    For “How data has to be Analysed”, parameters like what benchmark to select for checking the quality of education, questions that need to be included to test the basic subject knowledge, etc
  • As the objective and the parameters are set, the next phase is collect or gather all the information that are available relevant to the second phase, it’s not necessary that all the information be useful, but in this phase, the importance is to only collect as much as data as possible.
  • In this phase all the collected data is segregated, all the unnecessary information is discarded and the useful ones are organized. The best way to do this by asking the following questions:
    1. Does the collected data provide satisfactory answers to the question about schools?
    2. Will the data collected prove useful to defend counter questions?

Once all stages are completed, you will have a well organized data that will help you find a solution to the question that we took as an example.

How do you improve your data analysis skills on a daily basis?

Few basic habits can actually make a big difference. Following the below habits can have a huge impact on how you develop your skills:

  • Gather: This is where you need to go through materials related to Data Analytics, there are various tutorials available online, read books, watch videos, etc
  • Documentation: Once you gather information, it is important to retain that and what’s better than making your own notes? They can be just scribblings about the highlight of a particular topic in Data Analytics, this will help you go back to ideas that arose during the tutorial.
  • Art of Application: Whatever information you have, it is important to apply it practically, try to come up with scenarios and find a solution to it.

One of the most important part that an analyst overlooks is the use of proxies.

Proxies can help in the following ways:

– Automating the process of extracting the data without worrying about the IP getting blocked

– Spoofing location and getting geo-specific data

Big Data, as the name suggests is a collection of large data, this collection can be in the form of structured or unstructured data.
This data can help an organization to get insights that can further help them in taking data driven decisions, thus improving the overall progress of the company.

Earlier Big Data was often recognized by 3Vs but these have expanded and now can be characterized by 6Vs.

1. The data collected from different sources which are relevant to the organization can be characterized as a volume

2. Information comes in a variety of formats. They range from organized, conventional databases, also unstructured content, email, video, sound, etc.

3. The third most important characteristic is the speed or velocity at which the data is collected.

4. One of the additional characteristics is veracity, this means the level of authenticity of the collected data.

5. Once the data is collected and structured, the next important characteristic to measure is its value i.e the value that it holds to the organization

6. Variability helps us understand the various ways the collected data can be used.

How does Big Data leave an impact?

The size or the volume of data collected is not important but how the collected information is utilized is what matters.

The data that is collected can be analyzed to get information or answers to cost effectiveness, proper time management, data driven decisions and when big data and such analytics are mixed together, you can easily determine:

  • The exact reason for system failure, what are the pressing issues and other effects almost in real time
  • Analyzing the behavior of the potential buyer and thus improving or personalizing the sign up process
  • With such a massive amount of data, it is easy to know the fraud patterns and can be blocked before it gives a massive blow

 

Who uses Big Data:

1. Banking Sector: When it comes to banking, there is always a large amount of data that flows in all the time. Banking sector usually uses this information to be to improve security and better the customer experience. But banks have to make use of powerful analytics tool to take real advantage of this data.

2. Schooling and overall education: Usually the need for big data analysis is played down when it comes to education. But it can turn out to be very crucial steps towards improving the overall education system. Big data analysis can easily show key results related to student’s performance, the average passing percentage and what other improvements can be done in the system.

3. Healthcare Industry: This industry or rightly put as “service” benefits with their patients’ data, like their previous diagnosis, medication taken or that is being used and any other patient record. Analysing these data can prove crucial for an effective treatment.

4. Manufacturing: When large volumes of data from the manufacturing industry are analysed, it helps in improving the quality of the product, customer service and helps companies to realize market trends and what a customer wants.

5. Administration: Administration can also be defined as a large body with a lot of subdivisions and the data generated from each department can be humongous. To provide a better way of life to the people of the region, to understand their issues and to be able to quickly implement

what is hadoop

It is a software that allows the processing of Big data across a lot of computers using a simple program.

The problems with Big data and how Hadoop solves it

1. Storage: Nowadays the size of big data for an organization grows exponentially and it is not cost effective in putting resources in the high storage server.
Resolution: Hadoop uses HDFS i.e Hadoop Distributed File System which can store data in different hardware but process the same data parallelly.

2. High volume of data: This creates an issue because the data that is received can be structured, unstructured or semi-structured and especially in different formats
Resolution: This issue is again resolved by Hadoop Distributed File System as there is no pre-dumpling schema validation, so whenever a new data is out under HDFS, there is no need to define the schema.

3. Computing power: The size of data that usually comes in nowadays is in Terabytes and it will take a great amount of time to process all these data. For example, you have 2 TB of data and a computing power of 1Gbps, so the total time taken to process all these data will be around 34 minutes and this will significantly increase if the volume of data is very high

Resolution: Hadoop uses a cluster of computers or it’s functioning and hence the processing of data is run parallelly and this significantly reduces the computing time.

Configuring Hadoop

Let’s dive a bit into the technical aspect of Hadoop configuration. The following are the steps for a single node installation.

Prerequisites

– Hadoop needs a working Java 1.5+ (aka Java 5) installation.

user@ubuntu:~$  sudo apt-get update #Updating the source list

or Install it

user@ubuntu:~$ sudo apt-get install sun-java6-jdk

Adding a dedicated Hadoop system user

user@ubuntu:~$ sudo addgroup hadoop_group
user@ubuntu:~$ sudo adduser --ingroup hadoop_group hdusert

This will add the user hdusert and the group hadoop_group to the local machine.

Add hdusert to the sudo group

user@ubuntu:~$ sudo adduser hdusert sudo

Configuring SSH

A key has to be generated for the hduser user

user@ubuntu:~$ su – hdusert
hdusert@ubuntu:~$ ssh-keygen -t rsa -P ""

NOTE: P “”, here indicates an empty password

This will create an RSA key pair with an empty password.

Now SSH access has to be enabled for your local machine with this created key and  is done by the following command.

hdusert@ubuntu:~$   cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Installation

The first step is to switch to hduser

hduser@ubuntu:~$ su - hdusert

Next step is to download and extract Hadoop 1.2.0. Also, Setup Environment Variables for Hadoop

export HADOOP_HOME=/usr/local/hadoop
# Add Hadoop bin/ directory to PATH
export PATH= $PATH:$HADOOP_HOME/bin

Configuration

Change the file: conf/hadoop-env.sh

#export JAVA_HOME=/usr/lib/j2sdk1.5-sun


in the following file

# export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64  (for 64 bit)
# export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64  (for 32 bit)

Create the directory and set the required ownerships and permissions

hduser@ubuntu:~$ sudo mkdir -p /app/hadoop/tmp

hduser@ubuntu:~$ sudo chown hduser:hadoop /app/hadoop/tmp

hduser@ubuntu:~$ sudo chmod 750 /app/hadoop/tmp

Paste the following between <configuration>

In file conf/core-site.xml

<property>
    <name>hadoop.tmp.dir</name>
    <value>/app/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
</property>

<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:54310</value>
    <description>The name of the default file system.  A URI whose
    scheme and authority determine the FileSystem implementation.  The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class.  The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
</property>

In file conf/mapred-site.xml

<property>
<name>mapred.job.tracker</name>
    <value>localhost:54311</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
</property>

In file conf/hdfs-site.xml

<property>
    <name>dfs.replication</name>
    <value>1</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
</property>

Now Format the HDFS filesystem via the NameNode

hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode –format

Finally, starting the single node cluster

hduser@ubuntu:~$ sudo chmod -R 777 /usr/local/hadoop

Run the following command

hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on the machine.

hduser@ubuntu:/usr/local/hadoop$ jps

Secure big data analysis:

As there are tons of data that are downloaded every day, the challenges related to security also increases many folds. Few of them are listed below:

Data in Cloud

Cloud storage has facilitated a way to store a large amount of data but it comes with privacy and security issues. You should be careful whom you select as your cloud provider and get the details of the security setup before purchase

Also Read Real Life Examples Of The Application Of Big Data Analytics

Access Control

This is also a very important security feature that is overlooked that involves not only deciding which users get access to the data but also how much access is provided to each user

Data Protection

When data is collected on such a massive scale, there are chances that sensitive information gets mixed up and leaked.

Conclusion:

Big data is a boon is so many ways, it can help an organization evaluate their performance, make corrections and optimize their delivery.

They should also secure the data from a breach and other security vulnerability. When the data is collected there should be real time monitoring and as the information stored runs into Terabytes of data, proper measures should be in place to safeguard it. These are the few steps that can be undertaken for secure Big Data Analysis

About the author

Rachael Chapman

A Complete Gamer and a Tech Geek . Brings out all her thoughts and Love in Writting Techie Blogs.

Browse by Category

JOIN OUR NEWSLETTER

Type e-mail address in the box below to receive latest news.

FOLLOW US