For this question, you will analyze the “tweets.csv” data set using the TF-IDF (term frequency…

For this question, you will analyze the “tweets.csv” data set using the TF-IDF (term frequency….

For this question, you will analyze the “tweets.csv” data set using the TF-IDF (term frequency and inverse document frequency) Jupyter Lab notebook. As before, the text of the tweet will be the predictor variable (independent var) and the “is_fa_complaint” (flight attendant complaint or not) is the dependent variable or target. Develop (as before) 4 versions of models based on i) binary vectorization ii) counts iii) TF-IDF and finally iv) an N-gram vectorizer with N = 2 (2-gram). However use ONLY Logistic regression for all 4 versions of feature vectors Use the following parameters:

A) 40% of the tweets.csv data set to be the TEST data set

B) Random_state = 35 to ensure replicability of your model.

Develop ROC curves for all 4 versions of the text mining model. Which of the following statements is true?

NOTES:

AUC_Binary = Area under the curve for binary vectorization

AUC_Count = Area under the curve for count vectorization

AUC_TFIDF = Area under the curve for TF-IDF vectorization

AUC_2GRAM = Area under the curve for 2-gram vectorization”

,choose an answer:

AUC_Binary = 0.851

AUC_Binary = 0.95

AUC_Count = 0.951

AUC_Count = 0.851

For this question, you will analyze the “tweets.csv” data set using the TF-IDF (term frequency…