Daniel Chen: Cleaning and Tidying Data in Pandas | PyData DC 2018

227.929 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Daniel Chen: Cleaning and Tidying Data in Pandas | PyData DC 2018

PyData DC 2018

Most of your time is going to involve processing/cleaning/munging data. How do you know your data is clean? Sometimes you know what you need beforehand, but other times you don't. We'll cover the basics of looking at your data and getting started with the Pandas Python library, and then focus on how to "tidy" and reshape data. We'll finish with applying customized processing functions on our data.
===
www.pydata.org

PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. 

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 

00 Introduction
18 Setup: Github Repo, Jupyter Setup
35 Loading Datasets - panda.read_csv()
43 Dataset / Dataframe At A Glance
53 Get First Rows: df.head()
58 Get Columns: df.columns
15 Get Index: df.index
37 Get Body: df.values
46 Get Shape: df.shape
04 Get Summarizing Statistics: df.info()
12 Filtering, Slicing a Dataset / Dataframe
25 Extract a Single Column: df['col_name']
12 Dataframe vs Series
41 Extract N Columns: df[['col1_name', 'col2_name']]
51 Panda's Version: df.version
26 Extract Rows: df.iloc
30 Extract Rows: df.loc vs df.iloc vs df.idx
45 Extract Rows: df.iloc
37 Extract Rows: df.ix - Deprecated
38 Extract Multiple Rows and Columns
00 Extract Rows using Boolean Subsetting
24 Extract Rows using Multiple Boolean Subsetting
55 Cleaning a Dataset / Dataframe
38 General Issues according to a "Tidy Data" Research Paper
45 Issue 1: Column Headers are Values and not Variables Names
19 Load Pew Dataset
55 Transform Columns into Rows: pd.melt()
59 Load Billboard Dataset
05 Transform Columns into Rows: pd.melt()
00 Issue 2: Multiple Variables are Stored in 1 Column
06 Load Ebola Dataset
22 Transform Columns into Rows: pd.melt()
14 Split Column using String Manipulation through Accessors
19 Extract Column / Series from Accessor Split: accessor.get()
13 Add Column to Dataframe
13 Contracted Form for pd.melt() and Accessor String Manipulation: pd.merge()
10 Issue 3: Variables Stored in Rows And Columns
25 Load Weather Dataset
30 Transform Columns into Rows: pd.melt()
1:00 Transform Rows into Columns
2:00 Transform Rows into Columns: pd.pivot() vs pd.pivot_table()
4:30 Transform Rows into Columns: pd.pivot_table()
6:19 Flatten nested / hierarchical table: pd.reset_index()
7:42 Issue 4: Multiple Types of Observational Unit in Same Table (i.e De-nomalized Table)
9:43 Extract Type Observational Unit in new Dataframe, Drop Duplicates
11:30 Create "key" for extracted observational unit dataframe
12:11 Save new dataframe: pd.to_csv()
13:22 Merge / Join dataframe on common columns
16:25 Randomly Sample a dataframe
17:15 Note on Memory Consumption between all 3 dataframes
18:25 Summary from "Tidy Data" Research Paper
20:06 Q&A
21:21 Q&A 1: Simulating R's Chaining in Python
24:49 Q&A 2: Best Practices on Braquet Notation vs Chaining

Huge s/o to https://github.com/KMurphs for the video timestamps!

Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVideoTimestamps					

Daniel Chen: Cleaning and Tidying Data in Pandas | PyData DC 2018

Nhạc Theo Chủ Đề

Liên kết website