시리즈 만들기¶

In [1]:

import pandas as pd 
s = pd.Series(['banana', 42]) # 리스트 전달하여 시리즈 생성
print(s)

0    banana
1        42
dtype: object

In [2]:

# 시리즈 생성시 문자열을 인덱스로 지정 가능 

s = pd.Series(['Wes Mckinney', 'Creator of Pandas'])
print(s)

0         Wes Mckinney
1    Creator of Pandas
dtype: object

In [3]:

s = pd.Series(['Wes Mckinney', 'Creator of Pandas'], index = ['Person', 'Who'])
print(s)

Person         Wes Mckinney
Who       Creator of Pandas
dtype: object

데이터프레임 만들기¶

In [5]:

# 딕셔너리를 data frame 클래스에 전달해야 함 

scientists = pd.DataFrame({
    'Name' : ['Rosaline Franklin', 'Wiliam Gosset'],
    'Occupation' : ['Chemist', 'Statistician'],
    'Born' : ['1920-07-25', '1876-06-13'],
    'Died' : ['1958-04-19', '1937-10-16'],
    'Age' : [37,61]})

print(scientists)

                Name    Occupation        Born        Died  Age
0  Rosaline Franklin       Chemist  1920-07-25  1958-04-19   37
1      Wiliam Gosset  Statistician  1876-06-13  1937-10-16   61

In [6]:

# index 인자와 columns 인자를 사용해서 데이터 프레임 구성 가능 

scientists = pd.DataFrame({
    'Occupation' : ['Chemist', 'Statistician'],
    'Born' : ['1920-07-25', '1876-06-13'],
    'Died' : ['1958-04-19', '1937-10-16'],
    'Age' : [37,61]},
    index = ['Rosaline Franklin', 'Wiliam Gosset'],
    columns = ['Occupation', 'Born', 'Age', 'Died'])

print(scientists)

                     Occupation        Born  Age        Died
Rosaline Franklin       Chemist  1920-07-25   37  1958-04-19
Wiliam Gosset      Statistician  1876-06-13   61  1937-10-16

In [9]:

# 딕셔너리는 데이터의 순서를 보장하지 않는다. 그래서 OrderedDict 클래스를 사용해야 한다. 

from collections import OrderedDict 

scientists = pd.DataFrame(OrderedDict([
    ('Name' , ['Rosaline Franklin', 'Wiliam Gosset']),
    ('Occupation' , ['Chemist', 'Statistician']),
    ('Born' , ['1920-07-25', '1876-06-13']),
    ('Died' , ['1958-04-19', '1937-10-16']),
    ('Age' , [37,61])
]))

print(scientists)

                Name    Occupation        Born        Died  Age
0  Rosaline Franklin       Chemist  1920-07-25  1958-04-19   37
1      Wiliam Gosset  Statistician  1876-06-13  1937-10-16   61

데이터프레임에서 시리즈 선택하기¶

In [11]:

scientists = pd.DataFrame({
    'Occupation' : ['Chemist', 'Statistician'],
    'Born' : ['1920-07-25', '1876-06-13'],
    'Died' : ['1958-04-19', '1937-10-16'],
    'Age' : [37,61]},
    index = ['Rosaline Franklin', 'Wiliam Gosset'],
    columns = ['Occupation', 'Born', 'Age', 'Died'])

print(scientists)

                     Occupation        Born  Age        Died
Rosaline Franklin       Chemist  1920-07-25   37  1958-04-19
Wiliam Gosset      Statistician  1876-06-13   61  1937-10-16

In [13]:

# 시리즈를 선택하려면 과학자 이름인 인덱스를 전달하면 된다. 

first_row = scientists.loc['Wiliam Gosset']
print(type(first_row))

<class 'pandas.core.series.Series'>

In [14]:

# 데이터프레임을 만들 때 age 열에 정수형 리스트를 전달해도 시리즈 출력해 보면 시리즈의 자료형을 오브젝트로 인식함 

print(first_row)

Occupation    Statistician
Born            1876-06-13
Age                     61
Died            1937-10-16
Name: Wiliam Gosset, dtype: object

index, values, keys 사용하기¶

In [15]:

# index 속성 사용하기 - 시리즈의 인덱스

print(first_row.index)

Index(['Occupation', 'Born', 'Age', 'Died'], dtype='object')

In [16]:

# values 속성 사용하기 - 시리즈의 데이터 

print(first_row.values)

['Statistician' '1876-06-13' 61 '1937-10-16']

In [17]:

# Keys 메서드 사용하기 - 속성 아님! index속성과 같은 역할을 함 

print(first_row.keys())

Index(['Occupation', 'Born', 'Age', 'Died'], dtype='object')

In [18]:

# index 속성 응용하기 

print(first_row.index[0])

Occupation

In [19]:

# keys 메서드 응용하기 

print(first_row.keys()[0])

Occupation

시리즈의 mean, min, max, std 메서드 사용하기¶

In [20]:

ages = scientists['Age']
print(ages)

Rosaline Franklin    37
Wiliam Gosset        61
Name: Age, dtype: int64

In [22]:

print(ages.mean())
print(ages.min())
print(ages.max())
print(ages.std())

49.0
37
61
16.97056274847714

In [23]:

print(ages.describe())

count     2.000000
mean     49.000000
std      16.970563
min      37.000000
25%      43.000000
50%      49.000000
75%      55.000000
max      61.000000
Name: Age, dtype: float64

시리즈와 불린 추출 사용하기¶

In [24]:

# 인덱스를 모르는 경우가 더 많음 - 불린 추출 = 특정 조건을 만족하는 값만 추출 가능 

scientists = pd.read_csv('C:/Users/USER/Doit/data/scientists.csv')

In [25]:

ages = scientists['Age']
print(ages.max())

In [26]:

print(ages.mean())

59.125

In [27]:

# 평균 나이보다 나이가 많은 사람의 데이터 추출

print(ages[ages > ages.mean()])

1    61
2    90
3    66
7    77
Name: Age, dtype: int64

In [28]:

print(ages > ages.mean()) # 해당 인덱스가 참인지 거짓인지 알 수 있다. 조건식을 만족한 값만 출력이 된다. 

0    False
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: Age, dtype: bool

In [30]:

# 참인 인덱스 번호가 전달이 되면 해당 데이터만 추출 가능 - 불린 추출

manual_bool_values = [True, True, False, False, True, True, False, True]
print(ages[manual_bool_values]) #시리즈가 인덱스 번호로 

0    37
1    61
4    56
5    45
7    77
Name: Age, dtype: int64

시리즈와 브로드캐스팅¶

In [33]:

# 결과값이 여러개인 이유 - 시리즈나 데이터 프레임에 있는 모든 데이터에 대해 한 번에 연산하는 것 = 브로드캐스팅
# 시리즈처럼 여러 개의 값을 가진 데이터 = 벡터 
# 단순 크기를 나타내는 데이터 = 스칼라

print(ages + ages) # 벡터끼리 연산 - 같은 길이의 베터 출력
print(ages * ages)

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64
0    1369
1    3721
2    8100
3    4356
4    3136
5    2025
6    1681
7    5929
Name: Age, dtype: int64

In [35]:

print(ages + 100) # 벡터에 스칼라 연산, 모든 값에 스칼라를 적용하여 브로드캐스팅한 결과 
print(ages * 2)

0    137
1    161
2    190
3    166
4    156
5    145
6    141
7    177
Name: Age, dtype: int64
0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

In [36]:

# 서로 길이가 다른 벡터 연산 - 누락값 NaN 발생 

print(pd.Series([1,100]))

0      1
1    100
dtype: int64

In [37]:

print(ages + pd.Series([1,100]))

0     38.0
1    161.0
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
dtype: float64

In [38]:

# sort_index 메서드 
# ascending = False 데이터 역순 

rev_ages = ages.sort_index(ascending = False)
print(rev_ages)

7    77
6    41
5    45
4    56
3    66
2    90
1    61
0    37
Name: Age, dtype: int64

In [39]:

# 인덱스가 일치하는 값끼리 연산을 하였음. 

print(ages * 2)
print(rev_ages + ages)

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64
0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

데이터프레임과 불린 추출¶

In [40]:

print(scientists[scientists['Age'] > scientists['Age'].mean()])

                   Name        Born        Died  Age     Occupation
1        William Gosset  1876-06-13  1937-10-16   61   Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90          Nurse
3           Marie Curie  1867-11-07  1934-07-04   66        Chemist
7          Johann Gauss  1777-04-30  1855-02-23   77  Mathematician

In [49]:

scientists = pd.read_csv('C:/Users/USER/Doit/data/scientists.csv')

# bool 벡터의 길이가 데이터프레임의 행 길이보다 짧으면 bool 벡터의 길이만큼만 연산한다. 
# 참, 거짓을 담은 리스트 bool 벡터

print(scientists.loc[[True, True, False, True, False, True, False, True]])

                Name        Born        Died  Age     Occupation
0  Rosaline Franklin  1920-07-25  1958-04-16   37        Chemist
1     William Gosset  1876-06-13  1937-10-16   61   Statistician
3        Marie Curie  1867-11-07  1934-07-04   66        Chemist
5          John Snow  1813-03-15  1858-06-16   45      Physician
7       Johann Gauss  1777-04-30  1855-02-23   77  Mathematician

데이터프레임과 브로드캐스팅¶

In [51]:

# 데이터프레임에 스칼라 연산 적용하기 

print(scientists *2) # 정수 데이터 2를 곱한 숫자, 문자열 데이터 문자열 2배 증가

                                       Name                  Born  \
0        Rosaline FranklinRosaline Franklin  1920-07-251920-07-25   
1              William GossetWilliam Gosset  1876-06-131876-06-13   
2  Florence NightingaleFlorence Nightingale  1820-05-121820-05-12   
3                    Marie CurieMarie Curie  1867-11-071867-11-07   
4                Rachel CarsonRachel Carson  1907-05-271907-05-27   
5                        John SnowJohn Snow  1813-03-151813-03-15   
6                    Alan TuringAlan Turing  1912-06-231912-06-23   
7                  Johann GaussJohann Gauss  1777-04-301777-04-30   

                   Died  Age                            Occupation  
0  1958-04-161958-04-16   74                        ChemistChemist  
1  1937-10-161937-10-16  122              StatisticianStatistician  
2  1910-08-131910-08-13  180                            NurseNurse  
3  1934-07-041934-07-04  132                        ChemistChemist  
4  1964-04-141964-04-14  112                    BiologistBiologist  
5  1858-06-161858-06-16   90                    PhysicianPhysician  
6  1954-06-071954-06-07   82  Computer ScientistComputer Scientist  
7  1855-02-231855-02-23  154            MathematicianMathematician

열의 자료형 바꾸기와 새로운 열 추가하기¶

In [52]:

print(scientists['Born'].dtype)

object

In [53]:

print(scientists['Died'].dtype)

object

In [54]:

# 날짜를 문자열로 저장한 데이터는 시간 관련 작업을 할 수 있도록 datetime 자료형으로 바꾸는 것이 좋음 
# to_datetime
born_datetime = pd.to_datetime(scientists['Born'],format = '%Y-%m-%d')
print(born_datetime)

0   1920-07-25
1   1876-06-13
2   1820-05-12
3   1867-11-07
4   1907-05-27
5   1813-03-15
6   1912-06-23
7   1777-04-30
Name: Born, dtype: datetime64[ns]

In [55]:

died_datetime = pd.to_datetime(scientists['Died'],format = '%Y-%m-%d')
print(died_datetime)

0   1958-04-16
1   1937-10-16
2   1910-08-13
3   1934-07-04
4   1964-04-14
5   1858-06-16
6   1954-06-07
7   1855-02-23
Name: Died, dtype: datetime64[ns]

In [59]:

scientists['born_dt'], scientists['died_dt'] = (born_datetime, died_datetime)
print(scientists.head())
print(scientists.shape)

                   Name        Born        Died  Age    Occupation    born_dt  \
0     Rosaline Franklin  1920-07-25  1958-04-16   37       Chemist 1920-07-25   
1        William Gosset  1876-06-13  1937-10-16   61  Statistician 1876-06-13   
2  Florence Nightingale  1820-05-12  1910-08-13   90         Nurse 1820-05-12   
3           Marie Curie  1867-11-07  1934-07-04   66       Chemist 1867-11-07   
4         Rachel Carson  1907-05-27  1964-04-14   56     Biologist 1907-05-27   

     died_dt  
0 1958-04-16  
1 1937-10-16  
2 1910-08-13  
3 1934-07-04  
4 1964-04-14  
(8, 7)

In [60]:

scientists['age_days_dt'] = (scientists['died_dt'] - scientists['born_dt'])
print(scientists)

                   Name        Born        Died  Age          Occupation  \
0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist   
1        William Gosset  1876-06-13  1937-10-16   61        Statistician   
2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse   
3           Marie Curie  1867-11-07  1934-07-04   66             Chemist   
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist   
5             John Snow  1813-03-15  1858-06-16   45           Physician   
6           Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist   
7          Johann Gauss  1777-04-30  1855-02-23   77       Mathematician   

     born_dt    died_dt age_days_dt  
0 1920-07-25 1958-04-16  13779 days  
1 1876-06-13 1937-10-16  22404 days  
2 1820-05-12 1910-08-13  32964 days  
3 1867-11-07 1934-07-04  24345 days  
4 1907-05-27 1964-04-14  20777 days  
5 1813-03-15 1858-06-16  16529 days  
6 1912-06-23 1954-06-07  15324 days  
7 1777-04-30 1855-02-23  28422 days

시리즈, 데이터프레임의 데이터 섞어보기¶

In [61]:

print(scientists['Age'])

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

In [62]:

# 섞으려면 랜덤 라이브러리 불러와야 함 

import random

random.seed(42)
random.shuffle(scientists['Age'])
print(scientists['Age'])

0    66
1    56
2    41
3    77
4    90
5    45
6    37
7    61
Name: Age, dtype: int64

C:\Users\USER\anaconda3\lib\random.py:362: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x[i], x[j] = x[j], x[i]

데이터프레임의 열 삭제하기¶

In [63]:

print(scientists.columns)

Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'born_dt', 'died_dt',
       'age_days_dt'],
      dtype='object')

In [64]:

# drop 메소드 사용 
# 첫번째 인자에 열 이름을 리스트에 담아 전달하고 두번째 인자에는 axis = 1 전달 

scientists_dropped = scientists.drop(['Age'],axis = 1)
print(scientists_dropped.columns)

Index(['Name', 'Born', 'Died', 'Occupation', 'born_dt', 'died_dt',
       'age_days_dt'],
      dtype='object')

피클 형식으로 저장하기¶

In [ ]:

#  데이터를 바이너리 형태로 직렬화한 오브젝트 저장방법 
#  스프레드시트보다 더 작은 용량으로 데이터 저장 가능 
# 시리즈, 데이터 프레임 모두 피클 형식으로 저장 가능 

names = scientists['Name']
# names.to_pickle('')

CSV 불러오기¶

In [ ]:

# to_csv

import xlwt names_df.to_excel()

import openyxl names_df.to_excel()

저작자표시 (새창열림)

'Data Analysis > 완독연습' 카테고리의 다른 글

🏄‍♂[Do it! Pandas] 04 그래프 그리기 (0)	2022.03.31
🏄‍♂[Do it! Pandas] 02 판다스 시작하기 (0)	2022.03.30

매운 블로그

🏄‍♂[Do it! Pandas] 03 판다스 데이터프레임과 시리즈

시리즈 만들기¶

데이터프레임 만들기¶

데이터프레임에서 시리즈 선택하기¶

index, values, keys 사용하기¶

시리즈의 mean, min, max, std 메서드 사용하기¶

시리즈와 불린 추출 사용하기¶

시리즈와 브로드캐스팅¶

데이터프레임과 불린 추출¶

데이터프레임과 브로드캐스팅¶

열의 자료형 바꾸기와 새로운 열 추가하기¶

시리즈, 데이터프레임의 데이터 섞어보기¶

데이터프레임의 열 삭제하기¶

피클 형식으로 저장하기¶

CSV 불러오기¶

'Data Analysis > 완독연습' 카테고리의 다른 글

댓글

티스토리툴바

🏄‍♂[Do it! Pandas] 03 판다스 데이터프레임과 시리즈

시리즈 만들기¶

데이터프레임 만들기¶

데이터프레임에서 시리즈 선택하기¶

index, values, keys 사용하기¶

시리즈의 mean, min, max, std 메서드 사용하기¶

시리즈와 불린 추출 사용하기¶

시리즈와 브로드캐스팅¶

데이터프레임과 불린 추출¶

데이터프레임과 브로드캐스팅¶

열의 자료형 바꾸기와 새로운 열 추가하기¶

시리즈, 데이터프레임의 데이터 섞어보기¶

데이터프레임의 열 삭제하기¶

피클 형식으로 저장하기¶

CSV 불러오기¶

'Data Analysis > 완독연습' 카테고리의 다른 글

관련글

댓글

티스토리툴바