What we are learning in ThoughtWorks Data University

Lorena de Souza
4 min readNov 19, 2019

Hello,
I’m participating of the ThoughtWorks Data University program at Bangalore/India. And I promised to some people of Belo's Office to send some overview about what we are studying here. But as the time goes on I couldn’t have enough time to share it. Because that, I will do a different kind of article (short and fast article during this week). The article will be just questions that we’ve been tried to answer here. Some questions I will add over time. Next time, I promise to detail each answer better and translate it for Portugueses. Let me know what do you find and give some feedbacks. Every question below is about fundamentals of Data Engineering process we've been learning during this program until now.

What is Big Data? Everything is Big Data? When it was started?

What are those 3 V’s that became famous around Big Data process?
Volume, velocity and variety.

Why Volume, Velocity and Variety became part of the definition of Big Data Process?

There are 4 areas that influence the Data Engineering process. What are these areas? DBA, BI, Developer and Infrastructure.

As a Data Science what they should care about in your daily?

As a Data Engineer what we should care about in our daily?
Create process that should be fault tolerant, scalable, resilient, maintainable, debuggable and high performance.

How can we garante resilience? And fault tolerance? And scalability of that process?

Named and explain those 4 important layers around Big Data system.
Infrastructure, Storage, Integration and Processing Data, Workflow Management and Orchestration.

Why that 4 layers are important part of the Big Data process?

There are a lot of file format, much more than we usually work as software developers. Named some examples of file.
Json, CSV, Parquet, Avro, ORC and e.t.c

Named some examples of file that is column oriented storage and row oriented storage.
Parquet and Avro respectively.

Which mechanism allows you to see the data from Parquet and Avro?
Serialisation and Deserialisation process.

When you need to search some data what does Parquet file be fast then Avro file format?
Parquet uses a column oriented approach it means it has more performance than a row oriented approach.

What is column oriented storage?

What is the main difference between Avro and Json file format?
Even though both have similar structure, Avro has another mechanism like serialisation, schema evolution, schema discovery. Also Avro are able to partition it, otherwise Json you can’t partition it between machines.

When should we use Parquet and Avro?
Avro
is more appropriate for stream process. In other hand, Parquet is more efficient for batch process. Why?

Named at least 3 important aspects about infrastructure of data process?Monitoring the data process, doing infra as code and encrypting data in movement.

What is the Big Data Paradigm shift?

Which basics metric should be important to monitoring that process?
Monitoring at least basic metrics like saturation, latency, traffic and error.

What saturation metric cares about? And latency?
Saturation metric cares about how "full" your service is. And Latency metric cares about the time it takes to service a request.

Why does Spark replace MapReduce?
Spark is available in many languages and have a powerful API. Have more than a map reduce process. It's possible to connect Spark with Hadoop distributed file system.

What is schema evolution and schema discovery?

When should we use Hadoop in standalone mode?
When we would like to test and are building the process.

In general where the transformation operation happened in the Hadoop architecture? Or how do Hadoop and Spark connect?

What do you think we should do to become a very secure data process?
Encrypting data in movement and checking the regulation of the region where the data passes. Check the regulation of the region that your data leaves.

What is the advantage of doing infrastructure as code?
Growth and expand it easily. To expand the potential scope of changes.

What is Spark? Wha is a DAG in Spark?

What is the mechanism that generates a new stage in Spark?
Shuffle operations

What is the shuffle process?
Shuffling is a process of redistributing data across partitions. Shuffling is the process of data transfer between stages.

What is the two kind of transformation operation in Spark?
Wide and Narrow.

What is the difference between Narrow transformation and wide transformation?
Narrow transformation
doesn’t need to use data from another partition. Input and Output are in the same partition. Example of narrow transformation: filter and map. Otherwise Wide transformation needs to use data from another partition. Usually wide transformation needs to shuffle the data.

Named the 4 important aspects about Data Quality.
Consistency, timeless, accuracy, completeness.

Which is the main proposal about Data Integrity?
Care about data corruption.

What Data Locality is related?
It's related to the process of transfer the calculation process where the data lives.

I will be back with new answers and questions.

See ya.

--

--