0
0
0
1. 云栖社区>
2. 博客>
3. 正文

## 数据科学的原理与技巧 三、处理表格数据

apachecn_飞龙 2018-06-05 17:33:05 浏览1001

# 三、处理表格数据

## 索引、切片和排序

### 起步

``````# pd is a common shorthand for pandas
import pandas as pd``````

``````baby = pd.read_csv('babynames.csv')
baby``````
Name Sex Count Year
0 Mary F 9217
1 Anna F 3860
2 Emma F 2587
1891891 Verna M 5
1891892 Winnie M 5
1891893 Winthrop M 5

1891894 行 × 4 列

``````ls
# babynames.csv                  indexes_slicing_sorting.ipynb``````

`DataFrame`的标签称为`DataFrame`的索引，并使许多数据操作更容易。

### 索引、切片和排序

2016 年的五个最受欢迎的婴儿名字是？

#### 拆分问题

• 分割出 2016 年的行。
• 按照计数对行降序排序。

#### 使用`.loc`切片

``baby``
Name Sex Count Year
0 Mary F 9217
1 Anna F 3860
2 Emma F 2587
1891891 Verna M 5
1891892 Winnie M 5
1891893 Winthrop M 5

1891894 行 × 4 列

``````baby.loc[1, 'Name'] # Row labeled 1, Column labeled 'Name'
# 'Anna'``````

``````# Get rows 1 through 5, columns Name through Count inclusive
baby.loc[1:5, 'Name':'Count']``````
Name Sex Count
1 Anna F
2 Emma F
3 Elizabeth F
4 Minnie F
5 Margaret F

``````baby.loc[:, 'Year']
'''
0          1884
1          1884
2          1884
...
1891891    1883
1891892    1883
1891893    1883
Name: Year, Length: 1891894, dtype: int64
'''``````

``````baby.loc[:, 'Year'] * 2
'''
0          3768
1          3768
2          3768
...
1891891    3766
1891892    3766
1891893    3766
Name: Year, Length: 1891894, dtype: int64
'''``````

``````# This is a DataFrame again
baby.loc[:, ['Name', 'Year']]``````
Name Year
0 Mary
1 Anna
2 Emma
1891891 Verna
1891892 Winnie
1891893 Winthrop

1891894 行 × 2 列

``````# Shorthand for baby.loc[:, 'Name']
baby['Name']
'''
0              Mary
1              Anna
2              Emma
...
1891891       Verna
1891892      Winnie
1891893    Winthrop
Name: Name, Length: 1891894, dtype: object
'''``````
``````# Shorthand for baby.loc[:, ['Name', 'Count']]
baby[['Name', 'Count']]``````
Name Count
0 Mary
1 Anna
2 Emma
1891891 Verna
1891892 Winnie
1891893 Winthrop

1891894 行 × 2 列

``````# Series of years
baby['Year']
'''
0          1884
1          1884
2          1884
...
1891891    1883
1891892    1883
1891893    1883
Name: Year, Length: 1891894, dtype: int64
'''``````
``````# Compare each year with 2016
baby['Year'] == 2016
'''
0          False
1          False
2          False
...
1891891    False
1891892    False
1891893    False
Name: Year, Length: 1891894, dtype: bool
'''``````

``````# We are slicing rows, so the boolean Series goes in the first
# argument to .loc
baby_2016 = baby.loc[baby['Year'] == 2016, :]
baby_2016``````
Name Sex Count Year
1850880 Emma F 19414
1850881 Olivia F 19246
1850882 Ava F 16237
1883745 Zyahir M 5
1883746 Zyel M 5
1883747 Zylyn M 5

32868 行 × 4 列

### 对行排序

``````sorted_2016 = baby_2016.sort_values('Count', ascending=False)
sorted_2016``````
Name Sex Count Year
1850880 Emma F 19414
1850881 Olivia F 19246
1869637 Noah M 19015
1868752 Mikaelyn F 5
1868751 Miette F 5
1883747 Zylyn M 5

32868 行 × 4 列

``````# Get the value in the zeroth row, zeroth column
sorted_2016.iloc[0, 0]
# Get the first five rows
sorted_2016.iloc[0:5]``````
Name Sex Count Year
1850880 Emma F 19414
1850881 Olivia F 19246
1869637 Noah M 19015
1869638 Liam M 18138
1850882 Ava F 16237

## 分组和透视

``````baby = pd.read_csv('babynames.csv')
# the .head() method outputs the first five rows of the DataFrame``````
Name Sex Count Year
0 Mary F 9217
1 Anna F 3860
2 Emma F 2587
3 Elizabeth F 2549
4 Minnie F 2243

### 拆分问题

• `baby`表按`'Year'``'Sex'`分组。
• 对于每一组，计算最流行的名称。

• 遍历每个特定的年份。
• 对于每一年，遍历每个特定的性别。
• 对于每一个特定年份和性别，找到最常见的名字。

#### 分组

``````baby.groupby('Year')
# <pandas.core.groupby.DataFrameGroupBy object at 0x1a14e21f60>``````

`.groupby()`返回一个奇怪的`DataFrameGroupBy`对象。 我们可以使用聚合函数，在该对象上调用`.agg()`来获得熟悉的输出：

``````# The aggregation function takes in a series of values for each group
# and outputs a single value
def length(series):
return len(series)

# Count up number of values for each year. This is equivalent to
# counting the number of rows where each year appears.
baby.groupby('Year').agg(length)``````
Name Sex Count
Year
1880 2000 2000
1881 1935 1935
1882 2127 2127
2014 33206 33206
2015 33063 33063
2016 32868 32868

137 行 × 3 列

``baby.groupby('Year').agg(len)``
Name Sex Count
Year
1880 2000 2000
1881 1935 1935
1882 2127 2127
2014 33206 33206
2015 33063 33063
2016 32868 32868

137 行 × 3 列

``````year_rows = baby[['Year', 'Count']].groupby('Year').agg(len)
year_rows

# A further shorthand to accomplish the same result:
#
# year_counts = baby[['Year', 'Count']].groupby('Year').count()
#
# pandas has shorthands for common aggregation functions, including
# count, sum, and mean.``````
Count
Year
1880
1881
1882
2014
2015
2016

137 行 × 1 列

``````# Every twentieth year starting at 1880
year_rows.loc[1880:2016:20, :]``````
Count
Year
1880
1900
1920
1940
1960
1980
2000

### 多个列的分组

``````grouped_counts = baby.groupby(['Year', 'Sex']).sum()
grouped_counts``````
Count
Year Sex
1880 F
M 110491
1881 F
2015 M
2016 F
M 1880674

274 行 × 1 列

``````# The most popular name is simply the first one that appears in the series
def most_popular(series):
return series.iloc[0]

baby_pop = baby.groupby(['Year', 'Sex']).agg(most_popular)
baby_pop``````
Name Count
Year Sex
1880 F Mary
M John 9655
1881 F Mary
2015 M Noah
2016 F Emma
M Noah 19015

274 行 × 2 列

``````baby_pop.loc[(2000, 'F'), 'Name']
# 'Emily'``````

`.iloc`的行为与往常一样，因为它使用索引而不是标签：

``baby_pop.iloc[10:15, :]``
Name Count
Year Sex
1885 F Mary
M John 8756
1886 F Mary
M John 9026
1887 F Mary

#### 透视

``````pd.pivot_table(baby,
index='Year',         # Index for rows
columns='Sex',        # Columns
values='Name',        # Values in table
aggfunc=most_popular) # Aggregation function``````
Sex F M
Year
1880 Mary John
1881 Mary John
1882 Mary John
2014 Emma Noah
2015 Emma Noah
2016 Emma Noah

137 行 × 2 列

``baby_pop``
Name Count
Year Sex
1880 F Mary
M John 9655
1881 F Mary
2015 M Noah
2016 F Emma
M Noah 19015

274 行 × 2 列

## 应用、字符串和绘图

``````baby = pd.read_csv('babynames.csv')
# the .head() method outputs the first five rows of the DataFrame``````
Name Sex Count Year
0 Mary F 9217
1 Anna F 3860
2 Emma F 2587
3 Elizabeth F 2549
4 Minnie F 2243

### 拆解问题

• 计算每个名称的最后一个字母。
• 按照最后一个字母和性别分组，使用计数来聚合。
• 绘制每个性别和字母的计数。

### 应用

`pandas`序列包含`.apply()`方法，它接受一个函数并将其应用于序列中的每个值。

``````names = baby['Name']
names.apply(len)
'''
0          4
1          4
2          4
..
1891891    5
1891892    6
1891893    8
Name: Name, Length: 1891894, dtype: int64
'''``````

``````def last_letter(string):
return string[-1]

names.apply(last_letter)
'''
0          y
1          a
2          a
..
1891891    a
1891892    e
1891893    p
Name: Name, Length: 1891894, dtype: object
'''``````

### 字符串操作

`pandas`通过序列的`.str`属性，提供字符串操作函数。

``````names = baby['Name']
names.str.len()
'''
0          4
1          4
2          4
..
1891891    5
1891892    6
1891893    8
Name: Name, Length: 1891894, dtype: int64
'''``````

``````names.str[-1]
'''
0          y
1          a
2          a
..
1891891    a
1891892    e
1891893    p
Name: Name, Length: 1891894, dtype: object
'''``````

``````baby['Last'] = names.str[-1]
baby``````
Name Sex Count Year Last
0 Mary F 9217 1884
1 Anna F 3860 1884
2 Emma F 2587 1884
1891891 Verna M 5 1883
1891892 Winnie M 5 1883
1891893 Winthrop M 5 1883

1891894 行 × 5 列

### 分组

``````# Shorthand for baby.groupby(['Last', 'Sex']).agg(np.sum)
baby.groupby(['Last', 'Sex']).sum()``````
Count Year
Last Sex
a F 58079486
M 1931630 53566324
b F 17376
y M 18569388
z F 142023
M 120123 9649274

52 行 × 2 列

``````# When lines get long, you can wrap the entire expression in parentheses
# and insert newlines before each method call
letter_dist = (
baby[['Last', 'Sex', 'Count']]
.groupby(['Last', 'Sex'])
.sum()
)
letter_dist``````
Count
Last Sex
a F
M 1931630
b F
y M
z F
M 120123

52 行 × 1 列

### 绘图

`pandas`为大多数基本绘图提供了内置的绘图函数，包括条形图，直方图，折线图和散点图。 为了从`DataFrame`中绘制图形，请使用`.plot`属性：

``````# We use the figsize option to make the plot larger
letter_dist.plot.barh(figsize=(10, 10))
# <matplotlib.axes._subplots.AxesSubplot at 0x1a17af4780>``````

``````letter_pivot = pd.pivot_table(
baby, index='Last', columns='Sex', values='Count', aggfunc='sum'
)
letter_pivot``````
Sex F M
Last
a 58079486 1931630
b 17376 1435939
c 30262 1672407
x 37381 644092
y 24877638 18569388
z 142023 120123

26 行 × 2 列

``````letter_pivot.plot.barh(figsize=(10, 10))
# <matplotlib.axes._subplots.AxesSubplot at 0x1a17c36978>``````

``````total_for_each_letter = letter_pivot['F'] + letter_pivot['M']

letter_pivot['F prop'] = letter_pivot['F'] / total_for_each_letter
letter_pivot['M prop'] = letter_pivot['M'] / total_for_each_letter
letter_pivot``````
Sex F M F prop M prop
Last
a 58079486 1931630 0.967812 0.032188
b 17376 1435939 0.011956 0.988044
c 30262 1672407 0.017773 0.982227
x 37381 644092 0.054853 0.945147
y 24877638 18569388 0.572597 0.427403
z 142023 120123 0.541771 0.458229

26 行 × 4 列

``````(letter_pivot[['F prop', 'M prop']]
.sort_values('M prop') # Sorting orders the plotted bars
.plot.barh(figsize=(10, 10))
)
# <matplotlib.axes._subplots.AxesSubplot at 0x1a18194b70>``````

apachecn_飞龙
+ 关注