For this question, you will analyze the “tweets.csv” data set using the TF-IDF (term frequency….
For this question, you will analyze the “tweets.csv” data set using the TF-IDF (term frequency and inverse document frequency) Jupyter Lab notebook. As before, the text of the tweet will be the predictor variable (independent var) and the “is_fa_complaint” (flight attendant complaint or not) is the dependent variable or target. Develop (as before) 4 versions of models based on i) binary vectorization ii) counts iii) TF-IDF and finally iv) an N-gram vectorizer with N = 2 (2-gram). However use ONLY Logistic regression for all 4 versions of feature vectors Use the following parameters:
A) 40% of the tweets.csv data set to be the TEST data set
B) Random_state = 35 to ensure replicability of your model.
Develop ROC curves for all 4 versions of the text mining model. Which of the following statements is true?
NOTES:
AUC_Binary = Area under the curve for binary vectorization
AUC_Count = Area under the curve for count vectorization
AUC_TFIDF = Area under the curve for TF-IDF vectorization
AUC_2GRAM = Area under the curve for 2-gram vectorization”
,choose an answer:
AUC_Binary = 0.851
AUC_Binary = 0.95
AUC_Count = 0.951
AUC_Count = 0.851
For this question, you will analyze the “tweets.csv” data set using the TF-IDF (term frequency…