Feature Tools
This part of the library is designed for feature selection.
quest_selection
It follows these instructions:
Calculate the accuracy without feature selection, if it was not calculated.
In each iteration, ignore one feature. For all features, do it. After that, calculate the new accuracy. If it is greater than or equal to the general accuracy or the difference between them is smaller than or equal to flag_one_tol, then take this feature to the next step.
Create random combinations with these features and calculate the new accuracy. If one of the combinations provides an accuracy that is greater than or equal to general accuracy or their difference is smaller than or equal to fin_tol, then print it out to the console.
Parameters |
Datatype |
Default Value |
|---|---|---|
model_class |
any AI model class |
|
X_train |
multidimensional array |
|
y_train |
1D array |
|
X_test |
multidimensional array |
|
y_test |
1D array |
|
features |
list |
|
flag_one_tol |
float |
|
fin_tol |
float |
|
params |
dict |
None |
normal_acc |
float |
None |
trials |
integer |
100 |
Note
features need to contain the names of the columns in X array.
Note
params is for the model, model does not have to be created in default settings, it can be manipulated.
Note
normal_acc is the accuracy score for dataset without feature selection.
Attention
trials dedicates the third step to how many times it will be repeated.
Note
This function only prints out to the console, nothing returns.
list_deletings
It analyses the dataframe and shows or deletes columns that were found to be improper by the algorithm.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
extra |
list |
None |
del_null |
boolean |
True |
null_tolerance |
integer or float |
20 |
del_single |
boolean |
True |
del_almost_single |
boolean |
False |
almost_tolerance |
integer or float |
50 |
suggest_extra |
boolean |
True |
return_extra |
boolean |
False |
unique_tolerance |
integer or boolean |
10 |
Note
The column which their names are inside the extra library are deleted directly.
Note
While del_null is true, the columns that have null values greater than null_tolerance% of the total sample amount are deleted.
Note
If del_single is true, then the columns that have only one different value are deleted.
Note
While del_almost_single is true, the columns that have the same value that more than almost_tolerance% of the total sample amount are deleted.
Note
While suggest_extra is true, the string data held columns that have unique values greater than unique_tolerance% of the total sample amount are printed out in the console. The list of columns is also returned if return_extra is true.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
df |
pandas dataframe |
always |
2 |
columns |
list |
return_extra is True |
multi_split
It is designed to create different arrays with requested threshold values for train-test splitting.
Parameters |
Datatype |
Default Value |
|---|---|---|
df |
pandas dataframe |
|
test_size |
float |
|
output |
string |
|
threshold_set |
list |
Note
test_size must be between 0 and 1.
Attention
The very first multidimensional array inside lists is always created without any selection.
Attention
The output arrays are created in order to the order inside threshold_set.
Attention
Output arrays (y_train and y_test) are always the same for every output set. It is because measuring correctly the success rate between different approaches.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
X_trains |
list |
always |
2 |
X_tests |
list |
always |
3 |
y_train |
1D array |
always |
4 |
y_test |
1D array |
always |
rand_arr
It creates a one dimensional randomized array with sticking on a strategy.
Parameters
Datatype
Default Value
outputs
list
values
int list
None
strategy
string
equal
arr_size
int
1
Note
There are three determined strategies: ‘equal’, ‘weighted’ and ‘piled’. In ‘equal’, each output has the same probability. In ‘weighted’, the ith output has values[i] / sum(values) probability to come out. With a small difference in ‘piled’, the ith output has (values[i] - values[i-1] (or 0 if i is zero)) / values[-1] probability. Because of the behavior of piled, the values array must be sorted.
Priority (in return) |
Returns |
Datatype |
Condition |
|---|---|---|---|
1 |
column |
1D Array |
always |