Project 1

Prompt Sensitivity MVP

Project Overview

This project explored how AI and machine learning systems respond to short, vague, or ambiguous inputs. The goal was to evaluate when a model prediction is reliable and when human review should be required.

Problem

AI tools can produce confident answers even when the input is unclear. This creates risk in real-world settings where decisions may depend on incomplete or ambiguous language.

Tools Used

Python
pandas
scikit-learn
Logistic Regression
AI-assisted documentation

Process

I created a small testing workflow to compare model predictions against ambiguous text inputs. I reviewed examples where the model may classify text too confidently, then added a human review step for unclear cases.

Code Artifact

			
test_prompts = [
    "This is bad",
    "I am sick of this",
    "That was crazy",
    "I can’t deal with this anymore",
    "This is fine"
]
review_notes = []
for prompt in test_prompts:
    prediction = model.predict([prompt])[0]
    confidence = max(model.predict_proba([prompt])[0])
    if confidence < 0.75:
        review_status = "Needs human review"
    else:
        review_status = "Model prediction accepted"
    review_notes.append({
        "Prompt": prompt,
        "Prediction": prediction,
        "Confidence": round(confidence, 2),
        "Review Status": review_status
    })
results_df = pd.DataFrame(review_notes)
print(results_df)

		

Output

Prompt	Prediction	Confidence	Review Status
This is bad	Negative	0.68	Needs human review
I am sick of this	Negative	0.72	Needs human review
That was crazy	Negative	0.61	Needs human review
I can’t deal with this anymore	Negative	0.84	Model prediction accepted
This is fine	Positive	0.80	Model prediction accepted

AI Artifact

I used AI to help identify where model outputs could become misleading. The AI helped me turn technical results into review questions, such as:

Is the input too vague to classify confidently?
Could slang or tone change the meaning?
Should the prediction be accepted or flagged for review?
Does the confidence score support the model’s decision?

Human Review Decision

I decided that any prediction below a confidence threshold should be flagged for human review. This helped make the workflow more responsible because the model was not treated as automatically correct.

Key Takeaway

This project showed me that model performance is not only about accuracy. It is also about knowing when the model may be uncertain and designing a process where humans can review risky or unclear outputs.

Generative AI Role

How I Used Generative AI

While developing this project, I used generative AI as a thought partner to explore how machine learning systems should handle ambiguous inputs and uncertain predictions. Rather than using AI to make final decisions, I used it to challenge assumptions, identify potential risks, and improve the clarity of my documentation.

Prompt

“How should a machine learning workflow handle short, ambiguous text inputs where the model may not have enough context to make a reliable prediction? What factors should be reviewed before accepting the output?”

AI Suggestion

The AI suggested that ambiguous inputs should be treated differently from high-confidence predictions and recommended incorporating confidence thresholds, human review checkpoints, and error analysis into the evaluation process. It also emphasized that model accuracy alone may not be sufficient when dealing with uncertain language.

What I Changed

I agreed with the recommendation to include human review but adapted it to fit the scope of my project. Instead of focusing only on model accuracy, I added a review process that considered confidence levels and ambiguity in the input text. I also connected these ideas to my own testing observations and evaluation results.

Why Human Judgment Mattered

The AI provided useful suggestions, but it could not determine whether those recommendations were appropriate for my specific project goals. I reviewed the suggestions, selected the ideas that aligned with my findings, and rejected anything that was not supported by my analysis. This reinforced my belief that AI can improve productivity and generate ideas, but human oversight remains essential when evaluating results and making decisions.

Key Takeaway

This project strengthened my understanding that responsible AI development is not only about building accurate models. It is also about recognizing uncertainty, questioning outputs, and designing workflows that include meaningful human review.