写在前面:
这是我见过的最严肃的数据集,几乎每一行数据背后都是生命和鲜血的代价。这次探索分析并不妄图说明什么,仅仅是对数据处理能力的锻炼。因此本次的探索分析只会展示数据该有的样子而不会进行太多的评价。有一句话叫“因为珍爱和平,我们回首战争”。这里也是,因为珍爱生命,所以回首空难。现在安全的飞行是10万多无辜的人通过性命换来的,向这些伟大的探索者致敬。

1
2
3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

导入数据集

1
2
crash = pd.read_csv("./Airplane_Crashes_and_Fatalities_Since_1908.csv")
crash.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 13 columns):
Date            5268 non-null object
Time            3049 non-null object
Location        5248 non-null object
Operator        5250 non-null object
Flight #        1069 non-null object
Route           3562 non-null object
Type            5241 non-null object
Registration    4933 non-null object
cn/In           4040 non-null object
Aboard          5246 non-null float64
Fatalities      5256 non-null float64
Ground          5246 non-null float64
Summary         4878 non-null object
dtypes: float64(3), object(10)
memory usage: 535.1+ KB
1
2
crash = crash.drop(["Summary","cn/In","Flight #","Route","Location"],axis=1)
crash.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 8 columns):
Date            5268 non-null object
Time            3049 non-null object
Operator        5250 non-null object
Type            5241 non-null object
Registration    4933 non-null object
Aboard          5246 non-null float64
Fatalities      5256 non-null float64
Ground          5246 non-null float64
dtypes: float64(3), object(5)
memory usage: 329.3+ KB
1
print(crash[2200:2205])
            Date   Time                      Operator              Type  \
2200  03/06/1968   8:00     Military - U.S. Air Force  Fairchild C-123K   
2201  03/08/1968  19:18                    Air Manila    Fairchild F-27   
2202  03/09/1968  23:20   Military - French Air Force      Douglas DC6B   
2203  03/19/1968  19:37     Viking Airways - Air Taxi        Cessna 182   
2204  03/23/1968  13:00  Fortaire Aviation - Air Taxi       Brantly 305   

     Registration  Aboard  Fatalities  Ground  
2200      54-0590    49.0        49.0     0.0  
2201      PI-C871    14.0        14.0     0.0  
2202        43748    20.0        19.0     0.0  
2203       N2623F     2.0         2.0     0.0  
2204       N2224U     5.0         3.0     0.0  

伤亡分析

伤亡排序

1
2
3
4
print(crash["Fatalities"].sum())
fatal_crash = crash[crash["Fatalities"].notnull()]
fatal_crash = fatal_crash.sort_values(by="Fatalities")
print(fatal_crash[-5:])
105479.0
            Date   Time                                    Operator  \
3562  06/23/1985   7:15                                   Air India   
2726  03/03/1974  11:41                      Turkish Airlines (THY)   
4455  11/12/1996  18:40  Saudi Arabian Airlines / Kazastan Airlines   
3568  08/12/1985  18:56                             Japan Air Lines   
2963  03/27/1977  17:07            Pan American World Airways / KLM   

                                      Type    Registration  Aboard  \
3562                     Boeing B-747-237B          VT-EFO   329.0   
2726            McDonnell Douglas DC-10-10          TC-JAV   346.0   
4455  Boeing B-747-168B / Ilyushin IL-76TD  HZAIH/UN-76435   349.0   
3568                     Boeing B-747-SR46          JA8119   524.0   
2963  Boeing B-747-121 / Boeing B-747-206B   N736PA/PH-BUF   644.0   

      Fatalities  Ground  
3562       329.0     0.0  
2726       346.0     0.0  
4455       349.0     0.0  
3568       520.0     0.0  
2963       583.0     0.0  
  • 内特里费空难:两架波音-747相撞,死亡583人,又称世纪大空难
  • 日航123空难:波音747撞富士山,单架飞机失事最高死亡记录
  • 恰尔基达德里撞机事件,最严重的的空中撞机事件
  • 土耳其航空981号班机空难:货舱门未锁定导致爆炸性施压
  • 印度航空182号班机:恐怖袭击

伤亡概率

1
print(fatal_crash["Fatalities"].sum() / fatal_crash["Aboard"].sum())
0.729700936002
1
print(fatal_crash[fatal_crash["Fatalities"] != 0]["Fatalities"].sum() / fatal_crash[fatal_crash["Fatalities"] != 0]["Aboard"].sum())
0.754191781605    

机型处理

处理函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
type_crash = fatal_crash["Type"]
def type_handle(x):
x = str(x)
if "McDonnell Douglas" in x:
return "McDonnell Douglas"
elif "Douglas" in x:
return "Douglas"
elif "Boeing" in x:
return "Boeing"
elif "Airbus" in x:
return "Airbus"
elif "Embraer" in x:
return "Embraer"
elif "Ilyushin" in x:
return "Ilyushin"
else:
return "other"
company_crash = type_crash.map(type_handle)
print(pd.value_counts(company_crash))
other                3581
Douglas               984
Boeing                376
McDonnell Douglas     123
Ilyushin               96
Embraer                61
Airbus                 35
Name: Type, dtype: int64
1
2
fatal_crash["company"] = company_crash
print(fatal_crash[:2])
            Date   Time          Operator               Type Registration  \
108   10/21/1926  13:15  Imperial Airways  Handley Page W-10       G-EBMS   
5178  11/08/2007   8:00    Juba Air Cargo         Antonov 12       ST-JUA   

      Aboard  Fatalities  Ground  year month company  
108     12.0         0.0     0.0  1926    10   other  
5178     4.0         0.0     2.0  2007    11   other  

处理结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def airplane_counte(x):
fatal_ratio = x["Fatalities"].sum() / x["Aboard"].sum()
crash_time = x.shape[0]
fatal_num = x["Fatalities"].sum()
return pd.Series({"fatal_num":fatal_num,"crash_time":crash_time,"fatal_ratio":fatal_ratio})

company = fatal_crash.groupby(['company']).apply(airplane_counte)
print(company)
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
company['crash_time'].drop("other").plot(kind='bar',title="time")
plt.subplot(1,3,2)
company['fatal_num'].drop("other").plot(kind='bar',title="fatal_num")
plt.subplot(1,3,3)
company['fatal_ratio'].plot(kind='bar',title="fatal_ratio")
plt.show()
                   crash_time  fatal_num  fatal_ratio
company                                              
Airbus                   35.0     2980.0     0.510711
Boeing                  376.0    18705.0     0.649434
Douglas                 984.0    16899.0     0.794350
Embraer                  61.0      644.0     0.779661
Ilyushin                 96.0     4547.0     0.883084
McDonnell Douglas       123.0     6827.0     0.531946
other                  3581.0    54877.0     0.785854

时间分析

1
2
3
4
5
def get_year(x):
return x.split("/")[-1]
fatal_crash['year'] = fatal_crash["Date"].map(get_year)
year_fatal = fatal_crash[fatal_crash["year"] != np.NaN][["year","Fatalities","Aboard"]]
year_fatal.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5256 entries, 108 to 2963
Data columns (total 3 columns):
year          5256 non-null object
Fatalities    5256 non-null float64
Aboard        5246 non-null float64
dtypes: float64(2), object(1)
memory usage: 164.2+ KB
1
2
3
4
5
def year_analysis(x):
return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum() / x["Aboard"].sum()})
year = year_fatal.groupby(["year"]).apply(year_analysis)
year = year.sort_index()
year.info()
<class 'pandas.core.frame.DataFrame'>
Index: 98 entries, 1908 to 2009
Data columns (total 3 columns):
fatal_num      98 non-null float64
fatal_ratio    98 non-null float64
time           98 non-null float64
dtypes: float64(3)
memory usage: 3.1+ KB
1
2
3
4
5
6
7
8
9
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
year["fatal_num"].plot(title="fatal_num")
plt.subplot(1,3,2)
year["time"].plot(title="crash_time")
plt.subplot(1,3,3)
year["fatal_ratio"].plot(title="fatal_ratio")
plt.show()

1
2
3
4
5
def get_month(x):
return x.split("/")[0]
fatal_crash['month'] = fatal_crash["Date"].map(get_month)
month_fatal = fatal_crash[fatal_crash["month"] != np.NaN][["month","Fatalities","Aboard"]]
month_fatal.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5256 entries, 108 to 2963
Data columns (total 3 columns):
month         5256 non-null object
Fatalities    5256 non-null float64
Aboard        5246 non-null float64
dtypes: float64(2), object(1)
memory usage: 164.2+ KB
1
2
3
4
5
def month_analysis(x):
return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum() / x["Aboard"].sum()})
month = month_fatal.groupby(["month"]).apply(year_analysis)
month = month.sort_index()
print(month)
       fatal_num  fatal_ratio   time
month                               
01        8425.0     0.768354  494.0
02        7966.0     0.693057  395.0
03        8708.0     0.787057  453.0
04        6769.0     0.711852  378.0
05        7130.0     0.731807  370.0
06        7909.0     0.681399  385.0
07        9232.0     0.700349  427.0
08       10174.0     0.729162  474.0
09       10286.0     0.760349  458.0
10        8388.0     0.778758  452.0
11       10033.0     0.766522  454.0
12       10459.0     0.668478  516.0
1
2
3
4
5
6
7
8
9
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
month["fatal_num"].plot(kind="bar",title="fatal_num")
plt.subplot(1,3,2)
month["time"].plot(kind="bar",title="crash_time")
plt.subplot(1,3,3)
month["fatal_ratio"].plot(kind="bar",title="fatal_ratio")
plt.show()

小时

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def get_hour(x):
hour = x.split(":")[0]
try:
hour = float(hour)
if int(hour) == hour and hour < 24:
return hour
else:
return np.nan
except:
return np.nan

time_fatal = fatal_crash[fatal_crash["Time"].isnull() == False]
time_fatal["hour"] = time_fatal["Time"].map(get_hour)
time_fatal.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3049 entries, 108 to 2963
Data columns (total 13 columns):
Date            3049 non-null object
Time            3049 non-null object
Operator        3046 non-null object
Type            3048 non-null object
Registration    2952 non-null object
Aboard          3049 non-null float64
Fatalities      3049 non-null float64
Ground          3046 non-null float64
airplane        3049 non-null object
company         3049 non-null object
year            3049 non-null object
month           3049 non-null object
hour            3036 non-null float64
dtypes: float64(4), object(9)
memory usage: 333.5+ KB


c:\users\qiank\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
1
2
3
def hour_analysis(x):
return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum() / x["Aboard"].sum()})
hour = time_fatal.groupby(["hour"]).apply(hour_analysis)
1
2
3
4
5
6
7
8
9
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
hour["fatal_num"].plot(title="fatal_num")
plt.subplot(1,3,2)
hour["time"].plot(title="crash_time")
plt.subplot(1,3,3)
hour["fatal_ratio"].plot(title="fatal_ratio")
plt.show()