sklearn, Numpy and Pandas

Posted on 2022-03-28 by Admin

  There is an attribute in pandas for data operations such as where, drop and dropna: inplace, this word means in place, if inplace=true means the data itself should perform the operation; if inplace=false (default) means the operation affects It is a copy of the data, and the copy is also returned; so if it is drop, inplace should be true 

  Corr in DataFrame is actually (linear) correlation, what is correlation? is whether an increase in variable A leads to an increase (decrease) in B 

  When a DataFrame is aggregated (such as calculating the mean), it can be calculated from two dimensions, axis=0 is vertical, axis=1 is horizontal; for a data of m rows and n columns, df.mean(axis= 0) Returns a row of n-column data, because it is aggregated vertically; if df.mean(axis=1), it represents a horizontal aggregation, so it is a column of m rows of data. 

  　　num_attribs=list(housing_num) 

  　　num_attribs returns the column names 

  　　TypeError Traceback (most recent call last) <ipython-input-178-70bca072c4f7> in <module>() 1 from sklearn.model_selection import GridSearchCV 2 param_grid={ ----> 3 {'n_estimators':[3, 10, 30], 'max_features':[2,4,6,8]},{'bootstrap':[False], 'n_estimators':[3,10], 'max_features':[2,3,4]} 4 } 5 TypeError: unhashable type: 'dict' 

  　　This is because param_grid should be [], not {}; so this exception is thrown. 

  　　About zip in pyton 

1  　　a = [1,2,3 ]
 2  　　b = [4,5,6 ]
 3  　　c = [4,5,6,7,8 ]
 4  　　zipped = zip(a,b) #packed as a list of tuples

  　　>> [(1, 4), (2, 5), (3, 6)] 

 　   　zip(a,c) #The number of elements is consistent with the shortest list 

  　　>> [(1, 4), (2, 5), (3, 6)] 

    　　zip(*zipped) #In contrast to zip, it can be understood as decompression and returns a two-dimensional matrix 

  　　>> [(1, 2, 3), (4, 5, 6)] 

  The drop in the dataframe does not change the original data set (inplace=false by default), but returns the data set after the operation, as shown in the following example: strat_train_set does not actually delete the median_house_value column, but housing does not have the data of this column (housing is a part of strat_train_set copy, execute drop is housing this copy). 

1 housing = strat_train_set.drop("median_house_value", axis=1)
2 housing_labels = strat_train_set["median_house_value"].copy()

 注意dataframe的where条件，第一个谓词判断（条件判断），第二个是如果谓词判断不满足（条件返回为false），则替换为该值。第三个参数则是代表数据是否要覆盖当前数据集，如果True则是覆盖当前数据及，如果为false则不修改当前数据集，而是创建一个拷贝，然后对于拷贝数据集进行修改；如果设置为False需要接收返回值，因为下面的例子中“inplace=True”，所以修改是发生在当前数据集的，所以不需要接收返回值。 

    housing["income_cat"].where(housing["income_cat"] > 5, 5.0, inplace=True) 

  StratifierShuffleSplit，其实是StratifierKFloder和ShufflerSplit的组成，是交叉验证的实现。看一下下面的代码，注意这里train_index，以及test_index返回的其实是数组，这里编译器的处理和Java不同，java是逐个遍历，逐个处理，但是对于python而言，是一次性获取所有的值（train_index以及test_index都是数组），然后把数组扔给housing，让housing去遍历然后返回值给strat_train_set以及strat_test_set。 

1 split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
2 for train_index, test_index in split.split(housing, housing["income_cat"]):
3 　　print(str(train_index) + ";" + str(test_index) + "\n")
4 　　strat_train_set = housing.loc[train_index]
5 　　strat_test_set = housing.loc[test_index]

Related Posts