Feature Tools =============== This part of the library is designed for feature selection. quest_selection _________________ It follows these instructions: 1. Calculate the accuracy without feature selection, if it was not calculated. 2. In each iteration, ignore one feature. For all features, do it. After that, calculate the new accuracy. If it is greater than or equal to the general accuracy or the difference between them is smaller than or equal to flag_one_tol, then take this feature to the next step. 3. Create random combinations with these features and calculate the new accuracy. If one of the combinations provides an accuracy that is greater than or equal to general accuracy or their difference is smaller than or equal to fin_tol, then print it out to the console. ============ ====================== ============= Parameters Datatype Default Value ============ ====================== ============= model_class any AI model class - X_train multidimensional array - y_train 1D array - X_test multidimensional array - y_test 1D array - features list - flag_one_tol float - fin_tol float - params dict None normal_acc float None trials integer 100 ============ ====================== ============= .. note:: features need to contain the names of the columns in X array. .. note:: params is for the model, model does not have to be created in default settings, it can be manipulated. .. note:: normal_acc is the accuracy score for dataset without feature selection. .. attention:: trials dedicates the third step to how many times it will be repeated. .. note:: This function only prints out to the console, nothing returns. list_deletings _______________ It analyses the dataframe and shows or deletes columns that were found to be improper by the algorithm. ================= ================== ============= Parameters Datatype Default Value ================= ================== ============= df pandas dataframe - extra list None del_null boolean True null_tolerance integer or float 20 del_single boolean True del_almost_single boolean False almost_tolerance integer or float 50 suggest_extra boolean True return_extra boolean False unique_tolerance integer or boolean 10 ================= ================== ============= .. note:: The column which their names are inside the extra library are deleted directly. .. note:: While del_null is true, the columns that have null values greater than null_tolerance% of the total sample amount are deleted. .. note:: If del_single is true, then the columns that have only one different value are deleted. .. note:: While del_almost_single is true, the columns that have the same value that more than almost_tolerance% of the total sample amount are deleted. .. note:: While suggest_extra is true, the string data held columns that have unique values greater than unique_tolerance% of the total sample amount are printed out in the console. The list of columns is also returned if return_extra is true. ==================== ======= ================ ==================== Priority (in return) Returns Datatype Condition ==================== ======= ================ ==================== 1 df pandas dataframe always 2 columns list return_extra is True ==================== ======= ================ ==================== multi_split ____________ It is designed to create different arrays with requested threshold values for train-test splitting. ============= ================ ============= Parameters Datatype Default Value ============= ================ ============= df pandas dataframe - test_size float - output string - threshold_set list - ============= ================ ============= .. note:: test_size must be between 0 and 1. .. attention:: The very first multidimensional array inside lists is always created without any selection. .. attention:: The output arrays are created in order to the order inside threshold_set. .. attention:: Output arrays (y_train and y_test) are always the same for every output set. It is because measuring correctly the success rate between different approaches. ==================== ======== ======== ========= Priority (in return) Returns Datatype Condition ==================== ======== ======== ========= 1 X_trains list always 2 X_tests list always 3 y_train 1D array always 4 y_test 1D array always ==================== ======== ======== ========= rand_arr __________ It creates a one dimensional randomized array with sticking on a strategy. ============ ========== =============== Parameters Datatype Default Value ============ ========== =============== outputs list - values int list None strategy string equal arr_size int 1 ============ ========== =============== .. note:: There are three determined strategies: 'equal', 'weighted' and 'piled'. In 'equal', each output has the same probability. In 'weighted', the ith output has values[i] / sum(values) probability to come out. With a small difference in 'piled', the ith output has (values[i] - values[i-1] (or 0 if i is zero)) / values[-1] probability. Because of the behavior of piled, the values array must be sorted. ==================== ======= ======== ========= Priority (in return) Returns Datatype Condition ==================== ======= ======== ========= 1 column 1D Array always ==================== ======= ======== =========