Data Tools
===========

This part of the library is designed for data analysis and data manipulation.

col_types
__________

The function gets information about datatypes for each column inside the dataframe.

=============       ================    ==============
Parameters          Datatype            Default Value
=============       ================    ==============
df                  pandas dataframe    -
print_columns       boolean             False
=============       ================    ==============

.. attention::
    It prints out the results if print_columns is True.

====================    =======    ========    =========
Priority (in return)    Returns    Datatype    Condition
====================    =======    ========    =========
1                       types      list        always
====================    =======    ========    =========

.. code:: python

    import pandas as pd
    from wolta.data_tools import col_types

    df = pd.read_csv('data.csv')

    columns = list(df.columns)
    types = col_types(df)

    # prints out the datatype for each column
    for i in range(len(columns)):
        print('{}: {}'.format(columns[i], types[i]))

    #or just
    col_types(df, print_columns=True)

unique_amounts
__________

The function gets info about unique value amounts for each column inside the dataframe.

==========    ================    =============
Parameters    Datatype            Default Value
==========    ================    =============
df            pandas dataframe    -
strategy      list                None
print_dict    boolean             False
==========    ================    =============

.. attention::
    If only requested columns are wanted to be examined, then they must be indicated inside the strategy list. Also, it prints out the results if print_dict is True.

====================    =======    ========    =========
Priority (in return)    Returns    Datatype    Condition
====================    =======    ========    =========
1                       space      dict        always
====================    =======    ========    =========


make_numerics
__________

This function converts categorical data to numerical data.

===============    ========    =============
Parameters         Datatype    Default Value
===============    ========    =============
column             1D array    -
space_requested    boolean     False
===============    ========    =============

====================    =======    ========    =======================
Priority (in return)    Returns    Datatype    Condition
====================    =======    ========    =======================
1                       column     1D array    always
2                       space      dict        space_requested is True
====================    =======    ========    =======================


scale_x
________

It scales input values by using Sklearn's Standard Scaler.

==========    =======================    =============
Parameters    Datatype                   Default Value
==========    =======================    =============
X_train       multi dimensional array    -
X_test        multi dimensional array    -
==========    =======================    =============

====================    =======    =======================    =========
Priority (in return)    Returns    Datatype                   Condition
====================    =======    =======================    =========
1                       X_train    multi dimensional array    always
2                       X_test     multi dimensional array    always
====================    =======    =======================    =========

examine_floats
_______________

It examines requested columns that are pre-accepted as float if they have only integer values or not.

=============    ================    =============
Parameters       Datatype            Default Value
=============    ================    =============
df               pandas dataframe    -
float_columns    list                -
get              string              float
=============    ================    =============

.. hint::
    float_columns contains the names of the columns that will be checked by the algorithm.

.. hint::
    get may take two different values: 'float' and 'int'. If it equals to float, it returns non-integer column names. Unless, it returns integer column names.

====================    =======    ========    =========
Priority (in return)    Returns    Datatype    Condition
====================    =======    ========    =========
1                       space      list        always
====================    =======    ========    =========


calculate_bounds
_________________

It returns required datatypes to hold values according to the max and min values.

.. note::
    Sometimes datasets can be huge for our system's capabilities. At this point, decreasing the required space might be essential. This function is designed with this purpose.

.. tip::
    At this rate, I also suggest you use Dask library to get better results.

==========    ============    =============
Parameters    Datatype        Default Value
==========    ============    =============
gen_types     list            -
min_val       int or float    -
max_val       int or float    -
==========    ============    =============

.. hint::
    gen_types can be easily obtained by using col_types function.

====================    =======    ========    =========
Priority (in return)    Returns    Datatype    Condition
====================    =======    ========    =========
1                       types      list        always
====================    =======    ========    =========

calculate_min_max
__________________

It provides very detailed information about min values, max values and datatypes for each column.

.. note::
    This function might be beneficial because it is designed with an approach that does not load all data into memory at once if the device is incapable of doing this.

.. hint::
    The separation into multiple small data files is suggested for the dataset. In further, you may use Glob library in order to obtain paths easily.

===============    ========    =============
Parameters         Datatype    Default Value
===============    ========    =============
paths              list        -
deleted_columns    list        None
===============    ========    =============

.. attention::
    The indicated columns inside deleted_columns will be excluded during the process.

====================    =======    ========    =========
Priority (in return)    Returns    Datatype    Condition
====================    =======    ========    =========
1                       columns    list        always
2                       columns    list        always
3                       max_val    list        always
4                       min_val    list        always
====================    =======    ========    =========


load_by_parts
_______________

It enables to load multiple subsets of a dataset into a big one with extensive options.

=================    ========    =============
Parameters           Datatype    Default Value
=================    ========    =============
paths                list        -
strategy             string      default
deleted_columns      list        None
print_description    boolean     False
shuffle              boolean     False
encoding             string      utf-8
=================    ========    =============

.. note::
    strategy can have two different values: 'default' and 'efficient'. The only difference between them is that efficient detects datatypes by using calculate_bounds function. It is not suggested if it is not strictly required.

.. note::
    encoding can have all valid values for encoding parameter of the pandas' read_csv function.

====================    =======    ================    =========
Priority (in return)    Returns    Datatype            Condition
====================    =======    ================    =========
1                       df         pandas dataframe    always
====================    =======    ================    =========

create_chunks
_______________

It splits the dataset into small csv files.

=================    ========    =============
Parameters           Datatype    Default Value
=================    ========    =============
path                 string      -
sample_amount        integer     -
target_dir           string      None
print_description    boolean     False
chunk_name           string      part
=================    ========    =============

.. attention::
    This function does not return any value as result.

transform_data
________________

It transforms data by using some predetermined techniques. Further reading, you may read :ref:`transformation` article.

==========    =======================    =============
Parameters    Datatype                   Default Value
==========    =======================    =============
X             multi dimensional array    -
y             1D array                   -
strategy      string                     log-m
==========    =======================    =============

.. note::
    strategy can have these values: 'log', 'log-m', 'log2', 'log2-m', 'log10', 'log10-m', 'sqrt', 'sqrt-m', 'cbrt'

====================    =======    =======================    =====================
Priority (in return)    Returns    Datatype                   Condition
====================    =======    =======================    =====================
1                       X          multi dimensional array    always
2                       y          1D array                   always
3                       amin_x     integer or float           strategy ends with -m
4                       amin_y     integer or float           strategy ends with -m
====================    =======    =======================    =====================

transform_pred
_______________

It may seem like a reverse function for transform_data. For further reading, you may read :ref:`transformation` article.

==========    ================    =============
Parameters    Datatype            Default Value
==========    ================    =============
y_pred        1D array            -
strategy      string              log-m
amin_y        integer or float    0
==========    ================    =============

.. note::
    strategy can have these values: 'log', 'log-m', 'log2', 'log2-m', 'log10', 'log10-m', 'sqrt', 'sqrt-m', 'cbrt', 'cbrt-m'

====================    =======    ========    =========
Priority (in return)    Returns    Datatype    Condition
====================    =======    ========    =========
1                       y_pred     1D array    always
====================    =======    ========    =========

make_categorical
____________________

It turns continuous values into discrete ones by using normal distribution. The function separates three groups the given data with this approach. For further reading, you may read :ref:`distribution` article.

==========    ========    =============
Parameters    Datatype    Default Value
==========    ========    =============
y             1D array    -
strategy      string      normal
==========    ========    =============

.. note::
    strategy can have these values: 'normal' and 'normal-extra'.

is_normal
______________

It controls that given set behaves like normal distribution or not.

==========    ========    =============
Parameters    Datatype    Default Value
==========    ========    =============
y             1D array    -
==========    ========    =============

====================    =======    ========    =========
Priority (in return)    Returns    Datatype    Condition
====================    =======    ========    =========
1                       result     boolean     always
====================    =======    ========    =========

seek_null
___________

It checks each column in the dataframe to see if they have null values or not and how many if there are any. After the process it returns a list full of the names of the columns which have null values.

=============    ================    =============
Parameters       Datatype            Default Value
=============    ================    =============
df               pandas dataframe    -
print_columns    boolean             False
=============    ================    =============

.. attention::
    It is visible that how many null values have the columns, when print_columns is True. The values are printed out to the console.

====================    ============    ========    =========
Priority (in return)    Returns         Datatype    Condition
====================    ============    ========    =========
1                       null_columns    list        always
====================    ============    ========    =========

make_null
__________

Sometimes the null values may be represented in different ways (using 'unknown' in string data for example) instead of being null inside the dataset.

==========    ===============================    =============
Parameters    Datatype                           Default Value
==========    ===============================    =============
matrix        pandas dataframe or numpy array    -
replace       anything                           -
type          string                             df
==========    ===============================    =============

.. attention::
    type declares that matrix is a pandas dataframe or numpy array. It has two different valid values, which are 'df' and 'np'. 'df' means pandas dataframe, 'np' means numpy array.

====================    =======    ===============================    =========
Priority (in return)    Returns    Datatype                           Condition
====================    =======    ===============================    =========
1                       matrix     pandas dataframe or numpy array    always
====================    =======    ===============================    =========

stat_sum
_________

It summarises the collected info about dataframe like describe method from Pandas.

==========    ================    =============
Parameters    Datatype            Default Value
==========    ================    =============
df            pandas dataframe    -
requested     list                -
only          list                None
exclude       list                None
get_dict      boolean             False
verbose       boolean             True
==========    ================    =============

.. tip::
    If only is not None, only indicated columns in the list are examined.

.. tip::
    If exclude is not None, all columns except indicated in the list are examined.

Here is the list of valid keywords for requested:

=============    ===============================================================================
Valid Keyword    Meaning
=============    ===============================================================================
all              if the list has it at index zero then it presumes that it contains all keywords
min              minimum
max              maximum
width            the difference between max and min
mean             arithmetic mean
std              standard deviation
med              median
var              variance
=============    ===============================================================================

====================    ===========    ==========    ===================
Priority (in return)    Returns        Datatype      Condition
====================    ===========    ==========    ===================
1                       gen_results    dictionary    if get_dict is True
====================    ===========    ==========    ===================

extract_float
_______________

Sometimes float data might be held with a different representation ('3.5$' for example). In that case, the unwanted symbols (the dollar sign in this example) can be deleted and the datatype of the list can be converted from string to float.

==========    ========    =============
Parameters    Datatype    Default Value
==========    ========    =============
column        1D array    -
symbols       list        -
==========    ========    =============

====================    =======    ========    =========
Priority (in return)    Returns    Datatype    Condition
====================    =======    ========    =========
1                       column     1D Array    always
====================    =======    ========    =========

col_counts
____________

It returns the frequency of unique values in columns by using value_counts function from pandas.

==========    ================    =============
Parameters    Datatype            Default Value
==========    ================    =============
df            pandas dataframe    -
exclude       list                None
only          list                None
==========    ================    =============

.. tip::
    If only is not None, only indicated columns in the list are examined.

.. tip::
    If exclude is not None, all columns except indicated in the list are examined.

.. attention::
    This function only prints out the result to the console. It does not return anything.

check_similarity
__________________

Sometimes the very same information can be held into two different columns with different representations. For example, the area code information can be stored with digits and their actual names into two different columns, but they hold the same thing.

==========    ========    =============
Parameters    Datatype    Default Value
==========    ========    =============
col1          1D array    -
col2          1D array    -
==========    ========    =============

====================    ==========    ========    =========
Priority (in return)    Returns       Datatype    Condition
====================    ==========    ========    =========
1                       similarity    boolean     always
====================    ==========    ========    =========

.. attention::
    If similarity is true, then it means that these columns have the same information.

find_broke
____________

If the datatype of the column is different than expected, it can be examined by using this method and found the reason.

=============    ========    =============
Parameters       Datatype    Default Value
=============    ========    =============
column           1D array    -
dtype            datatype    float
get_indexes      boolean     True
get_words        boolean     False
verbose          boolean     True
verbose_limit    integer     10
=============    ========    =============

====================    =======    ========    ===================
Priority (in return)    Returns    Datatype    Condition
====================    =======    ========    ===================
1                       indexes    list        get_indexes is True
2                       words      list        get_words is True
====================    =======    ========    ===================


expand_df
___________

It oversamples the dataset by using SMOTE.

=================    ====================    =============
Parameters           Datatype                Default Value
=================    ====================    =============
df                   pandas dataframe        -
output               string                  -
sampling_strategy    string or dictionary    -
=================    ====================    =============

====================    =======    ================    =========
Priority (in return)    Returns    Datatype            Condition
====================    =======    ================    =========
1                       df         pandas dataframe    always
====================    =======    ================    =========

split_as_df
____________

It splits X and y arrays into train and test pandas dataframes instead of arrays.

============    ======================    =============
Parameters      Datatype                  Default Value
============    ======================    =============
X               multidimensional array    -
y               1D array                  -
features        list                      -
output          string                    -
test_size       float                     -
random_state    int                       42
shuffle         boolean                   True
stratify        1D array                  None
============    ======================    =============

.. note::
    features are the list of the names of columns in the X array. output is the name of the column in the y array.

====================    =======    ================    =========
Priority (in return)    Returns    Datatype            Condition
====================    =======    ================    =========
1                       dftrain    pandas dataframe    always
2                       dftest     pandas dataframe    always
====================    =======    ================    =========

train_test_val_split
_______________________

It splits the data into three groups: train, validation and test.

================    ======================    =============
Parameters          Datatype                  Default Value
================    ======================    =============
X                   multidimensional array    -
y                   1D array                  -
test_size           float                     -
val_size            float                     -
random_state        int                       42
shuffle             boolean                   True
stratify            1D array                  None
stratify_for_val    boolean                   True
================    ======================    =============

.. attention::
    The ratios given as test_size and val_size must be for the sum of the data.

====================    =======    ======================    =========
Priority (in return)    Returns    Datatype                  Condition
====================    =======    ======================    =========
1                       X_train    multidimensional array    always
2                       X_test     multidimensional array    always
3                       X_val      multidimensional array    always
4                       y_train    1D array                  always
5                       y_test     1D array                  always
6                       y_val      1D array                  always
====================    =======    ======================    =========

synthetic_expand
___________________

It generates synthetic data based on actual data. It is designed to create datasets for educational purposes in data science.

============    ================    =============
Parameters      Datatype            Default Value
============    ================    =============
df              pandas dataframe    -
feature_info    dictionary          -
shape_zero      integer             -
============    ================    =============

.. attention::
    feature_info must hold every requested feature name inside its keys from the dataframe. Values may take two different values: continuous and discrete.

====================    =======    ================    =========
Priority (in return)    Returns    Datatype            Condition
====================    =======    ================    =========
1                       df         pandas dataframe    always
====================    =======    ================    =========

multi_split
_____________

It splits the dataset, which is multi label. The function shuffles the dataset n times and, each time, seeks for the same sample-amount distribution in the test for classes in every label.

==========    ================    =============
Parameters    Datatype            Default Value
==========    ================    =============
df            pandas dataframe    -
labels        list                -
test_size     float               -
times         integer             50
==========    ================    =============

.. note::
    labels holds the list of the label names in the dataframe.

====================    ========    ======================    =========
Priority (in return)    Returns     Datatype                  Condition
====================    ========    ======================    =========
1                       X_train     multidimensional array    always
2                       X_test      multidimensional array    always
3                       y_trains    dictionary                always
4                       y_tests     dictionary                always
====================    ========    ======================    =========

.. attention::
    y_trains and y_tests have a key value structure as  label name - 1D array.


corr_analyse
______________

It calculates the correlation between each feature in the dataset. After calculation, it also groups them by respect to their strength.

.. attention::
    Each correlation value must be between -1 and 1. In the grouping phase, the range is accepted between 0 and 1, the negative side is also accepted as symmetric to the positive side. After that, the new range is split into four groups, which are 'uncorrelated', 'weak', 'strong' and 'perfect'.

============    ===================
group           interval
============    ===================
uncorrelated    0 <= y <= un_w
weak            un_w < score <= w_s
strong          w_s < score <= s_p
perfect         s_p < score <= 1
============    ===================

==========    ========    =============
Parameters    Datatype    Default Value
==========    ========    =============
array         2D array    -
columns       list        -
un_w          float       0.1
w_s           float       0.5
s_p           float       0.9
verbose       boolean     True
get_matrix    boolean     False
csv_path      string      None
==========    ========    =============

.. note::
    If csv_path is not none then the scores are logged in a csv file.

.. note::
    The function always returns a dictionary with keys 'columns' and 'score'.

====================    =======    ==========    ==================
Priority (in return)    Returns    Datatype      Condition
====================    =======    ==========    ==================
1                       results    dictionary    always
2                       matrix     2D array      get_matrix is True
====================    =======    ==========    ==================

scale_df
____________

It scales the values of the pandas dataframe and returns a pandas dataframe again. If requested, output label(s) might be excluded from this process.

============    =====================    ===============
Parameters      Datatype                 Default Value
============    =====================    ===============
df              pandas dataframe         -
output          string or string list    None
mode            string                   minmax
params          dictionary               None
============    =====================    ===============

.. note::
    The function supports four different scaling methods: MinMaxScaler (when mode is 'minmax'), StandardScaler (when mode is 'standard'), RobustScaler (when mode is 'robust') and MaxAbsScaler (when mode is 'maxabs').

====================    =======    ==================    ==================
Priority (in return)    Returns    Datatype              Condition
====================    =======    ==================    ==================
1                       df         pandas dataframe      always
====================    =======    ==================    ==================

corr_high
___________

It gets the names of the features that have a high correlation with the output label.

 ============ ================== ===============================
  Parameters   Datatype           Default Type
 ============ ================== ===============================
  df           pandas dataframe   -
  output       string             -
  strengths    string list        ['perfect', 'strong', 'weak']
  verbose      boolean            True
 ============ ================== ===============================

.. note::
    strengths can only contain 'perfect', 'strong', 'weak' and 'uncorrelated'.

====================    ============    ==================    ==================
Priority (in return)    Returns         Datatype              Condition
====================    ============    ==================    ==================
1                       feature_high    string list           always
====================    ============    ==================    ==================