pandas study notes 2

Posted on 2022-04-02 by Admin

Note source: Data analysis using python [Wes Mckinnney, translated by Tang Xuetao and others]

1. Aggregate and Calculate Descriptive Statistics

pandas objects hold a set of commonly used mathematical and statistical methods, mostly reduction and summary statistics, for extracting a single value from a Series (such as sum, mean) or extracting a Series from a row or column of a DataFrame.

1. Related examples

(1) Common reduction methods (sum(), mean(), etc.)

Common options for reduction methods:

Options	illustrate
axis	reduced axis
skipna	The default is True. That is, missing values are excluded by default
level	If the axis is a hierarchical index (i.e. MultiIndex), it is reduced by grouping by level

Indirect statistics (idxmax())

(2) Cumulative type [such as cumsum()]

(3) Summary statistics [such as describe()]

(4) Common summary descriptive statistics

method	illustrate
count	Count the number of non-NA values in the sample values
describe	Calculate summary statistics for Series or individual DataFrame columns
min、max	Calculate the minimum and maximum values of the sample values
argmin、argmax	Calculate the index position (integer) where the minimum and maximum values can be obtained
idxmin、idxmax	Calculate the index value that can get the minimum and maximum values
quantile	Calculate the quantile (0 to 1) of the sample values
sum	Calculate the sum of the sample values
mean	Calculate the mean of the sample values
median	Calculate the median (50% quantile) of sample values
mad	Calculate the mean absolute dispersion from the mean
where	Calculate the variance of the sample values
std	Calculate the standard deviation of sample values
skew	Calculate the skewness (third-order moment) of the sample values
kurt	Calculate the kurtosis (fourth order distance) of the sample values
cumsum	Calculate the cumulative sum of sample values
cummin、cummax	Calculate the cumulative maximum value and cumulative minimum value of the sample values
buy	Calculate the cumulative product of sample values
diff	Calculate first difference of sample values (useful for time series)
pct_change	Calculate percent change

2. Correlation coefficient and covariance

3. Unique values, value calculations, and membership

(1) Unique value (.unique())

The returned array is unsorted. If you want the returned array to be sorted, you can add unique.sort() to return

(2) Value calculation (.value_counts() counts the number of occurrences of all values)

value_counts() also works with pandas. ie pandas.value_count()

(3) Membership (isin() determines the membership of the vectorized set)

(4) Summarize the three methods

method	illustrate
ray	Calculates a boolean array representing "whether each value of the Series is contained in the passed in value sequence"
unique	Count array of unique values in series, returned in order of discovery
value_counts	Returns a Series whose index is the unique value and whose value is the frequency, sorted in descending order by the frequency value

2. Handling missing data

1. Understanding of missing values

pandas uses float value NaN (Not a Number) to represent missing data in float and non-float arrays

2. How to deal with missing values

(1) Filter out missing data ( .dropna() )

dropna returns a Series with only non-null data and index values

1) Missing value filtering of Series

2) DataFrame object filtering of missing values

dropna drops any rows with missing values by default

When dropna() passes in the parameter how = 'all', that is, the values are all missing values in a row, they are discarded

Specify axis=1, drop by column

(2) Fill in missing data ( .fillna() )

1) fllna() passes in a constant, which is filled with this constant

2) fillna() passes in a dictionary [fill the specified column with different values]

3) fillna() passes in the inplace keyword (default inplae=False. That is, the original data is not changed, when it is True, the original data is changed)

4) When fillna() passes in the keyword method='ffill' and limit

5) fillna passes in other statistical description methods

6) fillna function parameters

parameter	illustrate
value	scalar value or dictionary object to fill in missing values
method	Interpolation method, if no other parameters are specified when the function is called, the default is 'ffill'. [That is, the missing value is filled forward (the missing value is filled according to the previous value)]
axis	The axis to be filled, the default axis=0
inplace	Modify the caller object without making a copy
limit	(for forward and backward fills) the maximum number of consecutive fills that can be made

3. Hierarchical Index

Hierarchical indexing is also an important feature of pandas, which enables one or more index levels on an axis.

1. Relevant knowledge

(1) Series related examples

1) Create

2) value

3) Hierarchical index reshaping (stack(), unstack())

(2) DataFrame related examples [each axis can have a hierarchical index]

1) Create

2) Add a name to the row and column labels and get the value according to the index name

2. Rearrange the grading order (swaplevel(), sortlevel())

(1) swaplevel() [Only receive the numbers or names of two levels, return a new object with swapped levels, and the data does not change]

(2) sort_index (sort data based on values in a single level (stable))

Tip: sortlevel is used in Python2, but python3 is gradually discontinued, and has been replaced by sort_index

3. Summarize statistics according to the level

Many descriptions and summaries of DataFrames and Series have a level option that specifies the level at which to sum over an axis.

(1) Perform summary calculation based on the specified row index name

There is no axis value passed in the parentheses. The default is axis=0, that is, it is calculated by row [Note: the axis value should be consistent with the axis direction of the incoming parameter, and the key2 passed in here is also the row direction]

(2) Perform summary calculation based on the specified column index name

4. Use the columns of the DataFrame

(1) set_index() [Convert one or more columns into row indices and create a new DataFrame]

(2) reset_index() [In contrast to the above set_index, one or more row indices are converted to column indices ]

refer to:

[1]. McKinney, W.; Tang Xuetao et al. Translated. Using python for data analysis [M]. Beijing: Machinery Industry Press, 2013.9.

ProgrammerSought

pandas study notes 2

1. Aggregate and Calculate Descriptive Statistics

1. Related examples