The core theme is providing guidance and concrete project ideas for individuals struggling to start their data engineering projects, addressing common roadblocks like tool selection, data sourcing, and defining project goals.
Mind Map
Zum Vergrößern klicken
Klicke, um die vollständige interaktive Mind Map zu öffnen
what's going on guys welcome back to
another video with me ben rogue john aka
the seattle data guy today we're gonna
talk about the poll i ran recently on my
youtube community asking what problems
do you guys struggle with on your data
engineering projects most of you guys
just answered that you are struggling
with even starting your data engineering
project so let's talk about it first
let's talk about some of the reasons
people are struggling with starting data
engineering projects personally for me i
think one of the major causes was often
finding the right data sets another
common issue is the right tools you know
which tools should you be using and
another problem is well why are you even
doing this like what are you going to do
after maybe even ingesting the data
so let's start there and start trying to
figure out some of these problems let's
start with figuring out what tools you
should be using and in order to answer
this i think the best way to do it and
i'm going to kind of steal this from tuvu's
tuvu's
data analytics project video that she
recently put out why not just look at a
job description and figure out what
tools people are asking for so in this
case we're going to pull up a
smartsheets data engineering job
description and look at it and see well
what are they exactly asking for and
what skills can you deliver so just
scrolling through their data engineering
job description what you'll see is they
one have some sort of need for aws or
cloud computing and this is often why i
reference in terms of skill that cloud
should be learned early because at this
point most of our work is not on-prem it
really is in the cloud so you're going
to likely want to use something like gcp
aws azure whichever tool you prefer it's
not that important it's just important
that you pick one of these because i
think people are starting to understand
that more than likely if you've used one
you can use another after that you'll
see two of my personal favorites which
are airflow and snowflake so you will
likely want to use those tools as well
again you can also use bigquery or if
you want to be a little more maybe
contrarian or a little more out there
you can try dagster or prefect those are
also popular options at least according
to this twitter poll airflow still one
on the youtube poll but hey you know it
seems like there's a lot of different
tools that you guys can try out
personally i'd say just stick with
airflow to start out with you can look
at daxter and prefix maybe more as you
kind of get along there are a ton more
tools listed on this job description but
i personally wouldn't worry about it too
much let's just start with these few
tools because honestly we're more
focused on ingestion storage and then
somehow displaying everything you've
done you know visualization developing
an api
because honestly even these three things
can take a lot of time i know i've got a
video series that's all about two parts
because i've never finished my data
engineering project video series
thank you all for being patient with
that as a quick update for that more
than likely what i'm going to do is
completely do a different project and
actually finish it first and then record
it rather than what i was trying to do
back then which was working full time at
facebook consulting making content and
trying to finish a data engineering
project so now you kind of have a good
idea of your tools pick whichever cloud
provider you prefer i kind of like aws
or gcp they're just easy to spin up
pick some sort of tool to manage a lot
of the orchestration and unavoidably you
might end up picking this tool to kind
of doing your etl although airflow
itself is technically an orchestration
tool i do often see it being used as
everything you know people will just
create um you know either python
operators or we'll use some of the
operators that already exist to do the
elt or etl
obviously there's a ton of other tools
you can do with that now you can throw
an air byte or dvt or a whole host of
other things
but let's again just focus on this as a
basic project let's not get too crazy
because we want to actually deliver this
and if you have to learn uh six tools
just to do your first project you're
gonna have our time just finishing it so
let's focus on delivering this project
so we've got airflow for orchestration
and kind of again some ingestion in a
way we've got snowflake or bigquery
for storage and for your final product
we'll talk about that later first let's
find the most important thing or at
least some of the things that stop a lot
of people which is your data set data
sets are hard because there are data
sets that kind of come in all different
shapes and forms
i think it's important to understand
that first of all you are not a data
scientist meaning using a pre-clean
pre-processed data set is not your goal
your goal is likely to figure out how to
develop a pipeline so you need to figure
out how to set one up and actually pull
data from a raw source
and here's some great raw sources that
you can pull from one is flight radar 24
which will give you kind of all the
information about where flights are
currently they're allowed to longitude
and a lot of just other interesting
information and another one is
spacetrack.org which again i'm showing
here so you guys can go to it and that
will show you information about where
satellites are in space some other great
data sets that i'll talk about later um
include new york times uh movie review
data set also predicted which i've
talked about before they both have great
data sets but we're gonna talk about
that at the end of this video we'll give
you some project ideas the important
thing when you pick these data sets
personally is that likely you find
something that has some form of api this
is because a lot of our work still
involves pulling data
via some sort of code or api yes you're
gonna do plenty of work where you just
extract data from csvs but more than
likely you're gonna have to pull that
data from an api into a csv or parquet file
file
load it somehow into your data warehouse
and again now do the last part we're
gonna talk about which is figuring out
what in the world you're actually doing
with this data
now the easiest thing to do is create
some sort of basic visualization you
know with the flight radar 24 you could
do something as simple as you know how
many flights uh exist or happen in a day
and then you can maybe cut that or
segment that by you know where they're
going or where they're landing you could
segment it by flight maybe time or uh
length of travel or distance of travel
there's a couple different things you
could do there you know just figure out
some very basic metrics again it could
be as simple as count and just show that
on a graph you know even doing baseline
things like that just get you
comfortable with figuring out your why
like what are you actually doing all
this work for personally you should
answer this question prior to doing any
of the actual work but for some of us we
just kind of do this work in a stream
answering the why earlier is great
practice because then when you're
actually in industry you get used to
asking your analysts your data
scientists whoever your stakeholders are
why are we building this dashboard why
are we building this metrics why are we
building you know this model etc all of
that needs to happen because then you
know if your pipeline's worth building
but let's look at a real project that
someone did that actually got in the
news using flight data which was if you
haven't heard of it recently acs student
decided to pull data about flights and
specifically started to pull data about
billionaires flights and where they were
and started posting them
on twitter now obviously there's some
questionable security issues here but it
was an interesting use case i think this
is a fascinating way of going beyond
just a dashboard you know this person
decided hey people are interested in
knowing where elon musk is
why not just tweet about it whenever his
jet lands or wherever his jet is
obviously personally terrifying if i was
a billionaire because you know that
definitely puts a target on your back but
but
just an interesting use case in terms of
how you can use this data so you can do
the standard uh dashboard or if you want
to have more fun personally use d3.js
that's still something i occasionally
like to play around with or if you can
find something more practical one great
example is again this flight twitter bot
another kind of fun example that we saw
uh thanks to women's day was the pay gap
app where this person essentially
created a bot to track down
organizations talking positively about
women's day you know posting about it
and used uh public data to post how much
they actually had a pay gap
despite their pro-men stance again i
think it's just a fascinating use case
it's something that got a lot of
traction now it's got 256 uh twitter
followers only after one year of
existing which i just think shows the
interest that people have in tools like this
this
there are a lot of fun ways you can use
data and i think one way is just
creating a bot on twitter that kind of
share some of this information so don't
think you're limited to dashboards you
can create again api endpoints that
maybe create some sort of model you can
create a dashboard or you can create
some sort twitter bot that calls
companies out for their unfair wage gap
now that's enough high level you know
how would you do a project let's
actually dig down into ideas of projects
you can do
because that's what a lot of us need we
just sometimes need to hear concrete
ideas to get started and i'm gonna break
this project into three different ideas
you know beginner mid and more advanced
now for the beginner project i'd say go
look to use cloud composer or mwaa
because those
tools let you avoid having to set up
airflow from scratch and so you can
focus more on learning how to use
airflow and less on just setting up
airflow and then you can say you know
you've used gcp and aws so again two
words one stone from there you can use
an api like predicts api that returns
data back in xml kind of scrape that
down and maybe start trying to look for
possibly interesting trades i'd look for
trades that have maybe massive swings
maybe if you could create some sort of
model where you feel like
there is some sort of edge in certain uh
swings or maybe like the craziest swings
the biggest swings day over day and they
could start reporting on that it could
be like hey you know the question which
party will take the house in 20.2
there's just been a massive swing you
know in the last 24 hours and you can
post about it and maybe after that if
you want to get even more creative in
whatever display tool you use you know
flask twitter etc
you could track some articles down that
maybe you could discuss why now that
that's a little more tricky but again
part of this could just be you creating
a twitter bot that posts out this
information that says hey here was the
most massive swings in the last day that
happened on predicted and i can see
people finding that valuable again this
is very simple
you're using a cloud composer to just
ingest this data you're storing it in
something like bigquery or snowflake
and then after that you create some sort
of extra layer on top of that that post
outputs to twitter again probably just
via airflow but i think it's kind of fun
and very basic because again you're not
having to do too much in terms of like
setting up your infrastructure and
you're focusing more on the actual work
itself the next project is inspired by
start date engineering if you haven't
gone to that blog it's great has a lot
of great ideas on projects one of their
project ideas is on movie reviews and it
seems like they were just referencing
any kind of just bland csv about movie reviews
reviews
but instead what i'd recommend is you go
check out the new york times developers
portal and go pull movie reviews live
from their site via their api and then
kind of just use their project that
they've provided as a framework what's
great about these projects is they
really kind of broke it down like a
recipe you can just kind of look through
these slides and see exactly what you're
going to need they essentially tell you
the prerequisites you're going to need
everything from the fact you're going to
need docker
aws account and so forth um they're
going to kind of show you how to set up
apache airflow from scratch rather than
again having it preset up so i think
that's kind of great they're going to
have you set up several tools including
aws s3 aws redshift um you know i am etc
just a lot of different important
components that again when you're trying
to show off to uh future hiring managers
this will show a lot of great skill and
so that's what i think i really liked
about this project is there's so much
that they show here they show you
exactly some of the ideas in terms of
tasks you could build in airflow and
even some of the ways you could write
queries so again this is really a
full-blown project i will say they don't
really talk about data visualization or
anything of that nature so you will need
to kind of consider how are you going to
display this data before starting this
project maybe again you can either pick
a very easy tool like tableau to use or
d3.js and just do some basic
visualizations with this tool this is a
little more straightforward i think as a
project because most of these tools are
very well supported have a lot of
community behind them
and that's what i think makes it very
more mid-level again there's a lot of
components but it will be something that
if you've got some experience with
airflow and a little experience with aws
you should be able to do it pretty
easily now finally for people who are
more advanced we're going to go like the
complete almost 100 open source route in
tools almost for this example i'd
recommend going to spotsy's website they
do a lot of great projects in general in
this project they end up using uh
everything from s3 spark delta lake uh
some data science stuff
druid dagster it's pretty much a just
just
amalgamation of every open source fancy
tool you can think of they say it's
building it in 20 minutes i don't know
if that's true
i think it's going to take you longer
than 20 minutes but essentially with
this project you're going to scrape real
estate data the actual site itself so
part of it's going to be api but some of
it's going to kind of be based off the
site and you'll be scraping a ton of
information cleaning up html again
something else that i think is very
important uh just a great practice
you're gonna be implementing you know
change data capture and then you can do
some data science work uh some basic uh
visualizations with superset it's it's
honestly a full-blown project this is
going to take you a ton of time i think
it takes more than 20 minutes personally
i mean i guess if you copied it you know
word for word or maybe it's going to
take you 20 minutes but i imagine you
don't want to just copy this that would
be a bad idea i'd say try it for
yourself use this as a framework or an
idea and kind of push from there you
know i i look at these as challenges
more than anything else like if you're
going to build this you should just be
like i want to use you know
daxter druid superset you know pick
maybe two or three more tools and then
just try it it doesn't have to look
exactly like this person's in the end it
just needs to be close so hopefully if
you're out there and you're struggling
to start a project this honestly just
helps you do it because that's what
needs to happen
the biggest thing that holds most of us
back is a combination of ambiguity and a
lack of commitment i know sometimes i
fail to commit on ideas and then i just
never execute and so that's what's
important if you actually want to do
this project you need to hold yourself accountable
accountable
you need to be the one that says like
i'm going to do this project pick some
tools pick some data sets and just set
that in goal even if it's as simple as a
twitter bot that just posts the total
number of movie reviews that occur in a
week something that simple can still
just be a fun example or a great
starting point for you to iterate off of
thank you guys so much for watching this
video this week i will see you guys next
Klicke auf einen beliebigen Text oder Zeitstempel, um direkt zu dieser Stelle im Video zu springen
Teilen:
Die meisten Transkripte sind in unter 5 Sekunden bereit
Mit einem Klick kopieren125+ SprachenInhalt durchsuchenZu Zeitstempeln springen
YouTube-URL einfügen
Gib den Link eines beliebigen YouTube-Videos ein und erhalte das vollständige Transkript
Transkript-Extraktionsformular
Die meisten Transkripte sind in unter 5 Sekunden bereit
Unsere Chrome-Erweiterung installieren
Transkripte abrufen, ohne YouTube zu verlassen. Installiere unsere Chrome-Erweiterung und greife mit einem Klick direkt auf der Wiedergabeseite auf das Transkript jedes Videos zu.