pandas study notes 2
Note source: Data analysis using python [Wes Mckinnney, translated by Tang Xuetao and others]
1. Aggregate and Calculate Descriptive Statistics
pandas objects hold a set of commonly used mathematical and statistical methods, mostly reduction and summary statistics, for extracting a single value from a Series (such as sum, mean) or extracting a Series from a row or column of a DataFrame.
1. Related examples
(1) Common reduction methods (sum(), mean(), etc.)
Common options for reduction methods:
Options | illustrate |
axis | reduced axis |
skipna | The default is True. That is, missing values are excluded by default |
level | If the axis is a hierarchical index (i.e. MultiIndex), it is reduced by grouping by level |
Indirect statistics (idxmax())
(2) Cumulative type [such as cumsum()]
(3) Summary statistics [such as describe()]
(4) Common summary descriptive statistics
method | illustrate |
count | Count the number of non-NA values in the sample values |
describe | Calculate summary statistics for Series or individual DataFrame columns |
min、max | Calculate the minimum and maximum values of the sample values |
argmin、argmax | Calculate the index position (integer) where the minimum and maximum values can be obtained |
idxmin、idxmax | Calculate the index value that can get the minimum and maximum values |
quantile | Calculate the quantile (0 to 1) of the sample values |
sum | Calculate the sum of the sample values |
mean | Calculate the mean of the sample values |
median | Calculate the median (50% quantile) of sample values |
mad | Calculate the mean absolute dispersion from the mean |
where | Calculate the variance of the sample values |
std | Calculate the standard deviation of sample values |
skew | Calculate the skewness (third-order moment) of the sample values |
kurt | Calculate the kurtosis (fourth order distance) of the sample values |
cumsum | Calculate the cumulative sum of sample values |
cummin、cummax | Calculate the cumulative maximum value and cumulative minimum value of the sample values |
buy | Calculate the cumulative product of sample values |
diff | Calculate first difference of sample values (useful for time series) |
pct_change | Calculate percent change |
2. Correlation coefficient and covariance
3. Unique values, value calculations, and membership
(1) Unique value (.unique())
The returned array is unsorted. If you want the returned array to be sorted, you can add unique.sort() to return
(2) Value calculation (.value_counts() counts the number of occurrences of all values)
value_counts() also works with pandas. ie pandas.value_count()
(3) Membership (isin() determines the membership of the vectorized set)
(4) Summarize the three methods
method | illustrate |
ray | Calculates a boolean array representing "whether each value of the Series is contained in the passed in value sequence" |
unique |
Count array of unique values in series, returned in order of discovery |
value_counts | Returns a Series whose index is the unique value and whose value is the frequency, sorted in descending order by the frequency value |
2. Handling missing data
1. Understanding of missing values
pandas uses float value NaN (Not a Number) to represent missing data in float and non-float arrays
2. How to deal with missing values
(1) Filter out missing data ( .dropna() )
dropna returns a Series with only non-null data and index values
1) Missing value filtering of Series
2) DataFrame object filtering of missing values
dropna drops any rows with missing values by default
When dropna() passes in the parameter how = 'all', that is, the values are all missing values in a row, they are discarded
Specify axis=1, drop by column
(2) Fill in missing data ( .fillna() )
1) fllna() passes in a constant, which is filled with this constant
2) fillna() passes in a dictionary [fill the specified column with different values]
3) fillna() passes in the inplace keyword (default inplae=False. That is, the original data is not changed, when it is True, the original data is changed)
4) When fillna() passes in the keyword method='ffill' and limit
5) fillna passes in other statistical description methods
6) fillna function parameters
parameter | illustrate |
value | scalar value or dictionary object to fill in missing values |
method | Interpolation method, if no other parameters are specified when the function is called, the default is 'ffill'. [That is, the missing value is filled forward (the missing value is filled according to the previous value)] |
axis | The axis to be filled, the default axis=0 |
inplace | Modify the caller object without making a copy |
limit | (for forward and backward fills) the maximum number of consecutive fills that can be made |
3. Hierarchical Index
Hierarchical indexing is also an important feature of pandas, which enables one or more index levels on an axis.
1. Relevant knowledge
(1) Series related examples
1) Create
2) value
3) Hierarchical index reshaping (stack(), unstack())
(2) DataFrame related examples [each axis can have a hierarchical index]
1) Create
2) Add a name to the row and column labels and get the value according to the index name
2. Rearrange the grading order (swaplevel(), sortlevel())
(1) swaplevel() [Only receive the numbers or names of two levels, return a new object with swapped levels, and the data does not change]
(2) sort_index (sort data based on values in a single level (stable))
Tip: sortlevel is used in Python2, but python3 is gradually discontinued, and has been replaced by sort_index
3. Summarize statistics according to the level
Many descriptions and summaries of DataFrames and Series have a level option that specifies the level at which to sum over an axis.
(1) Perform summary calculation based on the specified row index name
There is no axis value passed in the parentheses. The default is axis=0, that is, it is calculated by row [Note: the axis value should be consistent with the axis direction of the incoming parameter, and the key2 passed in here is also the row direction]
(2) Perform summary calculation based on the specified column index name
4. Use the columns of the DataFrame
(1) set_index() [Convert one or more columns into row indices and create a new DataFrame]
(2) reset_index() [In contrast to the above set_index, one or more row indices are converted to column indices ]
refer to:
[1]. McKinney, W.; Tang Xuetao et al. Translated. Using python for data analysis [M]. Beijing: Machinery Industry Press, 2013.9.