pandas study notes 2

Note source: Data analysis using python [Wes Mckinnney, translated by Tang Xuetao and others]

1. Aggregate and Calculate Descriptive Statistics

pandas objects hold a set of commonly used mathematical and statistical methods, mostly reduction and summary statistics, for extracting a single value from a Series (such as sum, mean) or extracting a Series from a row or column of a DataFrame.

1. Related examples

(1) Common reduction methods (sum(), mean(), etc.)

Common options for reduction methods:

Options illustrate
axis reduced axis
skipna The default is True. That is, missing values ​​are excluded by default
level If the axis is a hierarchical index (i.e. MultiIndex), it is reduced by grouping by level

Indirect statistics (idxmax())

(2) Cumulative type [such as cumsum()]

(3) Summary statistics [such as describe()]

(4) Common summary descriptive statistics

method illustrate
count Count the number of non-NA values ​​in the sample values
describe Calculate summary statistics for Series or individual DataFrame columns
min、max Calculate the minimum and maximum values ​​of the sample values
argmin、argmax Calculate the index position (integer) where the minimum and maximum values ​​can be obtained
idxmin、idxmax Calculate the index value that can get the minimum and maximum values
quantile Calculate the quantile (0 to 1) of the sample values
sum Calculate the sum of the sample values
mean Calculate the mean of the sample values
median Calculate the median (50% quantile) of sample values
mad Calculate the mean absolute dispersion from the mean
where Calculate the variance of the sample values
std Calculate the standard deviation of sample values
skew Calculate the skewness (third-order moment) of the sample values
kurt Calculate the kurtosis (fourth order distance) of the sample values
cumsum Calculate the cumulative sum of sample values
cummin、cummax Calculate the cumulative maximum value and cumulative minimum value of the sample values
buy Calculate the cumulative product of sample values
diff Calculate first difference of sample values ​​(useful for time series)
pct_change Calculate percent change

2. Correlation coefficient and covariance

3. Unique values, value calculations, and membership

(1) Unique value (.unique())

The returned array is unsorted. If you want the returned array to be sorted, you can add unique.sort() to return

(2) Value calculation (.value_counts() counts the number of occurrences of all values)

value_counts() also works with pandas. ie pandas.value_count()

(3) Membership (isin() determines the membership of the vectorized set)

(4) Summarize the three methods

method illustrate
ray Calculates a boolean array representing "whether each value of the Series is contained in the passed in value sequence"

unique

Count array of unique values ​​in series, returned in order of discovery
value_counts Returns a Series whose index is the unique value and whose value is the frequency, sorted in descending order by the frequency value

2. Handling missing data

1. Understanding of missing values

pandas uses float value NaN (Not a Number) to represent missing data in float and non-float arrays

2. How to deal with missing values

(1) Filter out missing data ( .dropna() )

dropna returns a Series with only non-null data and index values

1) Missing value filtering of Series

2) DataFrame object filtering of missing values

dropna drops any rows with missing values ​​by default

When dropna() passes in the parameter how = 'all', that is, the values ​​are all missing values ​​in a row, they are discarded

Specify axis=1, drop by column

(2) Fill in missing data ( .fillna() )

1) fllna() passes in a constant, which is filled with this constant

2) fillna() passes in a dictionary [fill the specified column with different values]

3) fillna() passes in the inplace keyword (default inplae=False. That is, the original data is not changed, when it is True, the original data is changed)

4) When fillna() passes in the keyword method='ffill' and limit

5) fillna passes in other statistical description methods

6) fillna function parameters

parameter illustrate
value scalar value or dictionary object to fill in missing values
method Interpolation method, if no other parameters are specified when the function is called, the default is 'ffill'. [That is, the missing value is filled forward (the missing value is filled according to the previous value)]
axis The axis to be filled, the default axis=0
inplace Modify the caller object without making a copy
limit (for forward and backward fills) the maximum number of consecutive fills that can be made

3. Hierarchical Index

Hierarchical indexing is also an important feature of pandas, which enables one or more index levels on an axis.

1. Relevant knowledge

(1) Series related examples

1) Create

2) value

3) Hierarchical index reshaping (stack(), unstack())

(2) DataFrame related examples [each axis can have a hierarchical index]

1) Create

2) Add a name to the row and column labels and get the value according to the index name

2. Rearrange the grading order (swaplevel(), sortlevel())

(1) swaplevel() [Only receive the numbers or names of two levels, return a new object with swapped levels, and the data does not change]

(2) sort_index (sort data based on values ​​in a single level (stable))

Tip: sortlevel is used in Python2, but python3 is gradually discontinued, and has been replaced by sort_index

3. Summarize statistics according to the level

Many descriptions and summaries of DataFrames and Series have a level option that specifies the level at which to sum over an axis.

(1) Perform summary calculation based on the specified row index name

There is no axis value passed in the parentheses. The default is axis=0, that is, it is calculated by row [Note: the axis value should be consistent with the axis direction of the incoming parameter, and the key2 passed in here is also the row direction]

(2) Perform summary calculation based on the specified column index name

4. Use the columns of the DataFrame

(1) set_index() [Convert one or more columns into row indices and create a new DataFrame]

(2) reset_index() [In contrast to the above set_index, one or more row indices are converted to column indices ]

 

refer to:

[1]. McKinney, W.; Tang Xuetao et al. Translated. Using python for data analysis [M]. Beijing: Machinery Industry Press, 2013.9.