Lack of transparency and bias with datasets
Large language models (LLMs) are trained on a wide variety of datasets and aren't always transparent on which datasets are included and excluded. As a researcher, it is important to continuously critique the quality of generated content for bias and inclusivity.
Fake citations
GenAI can combine results from its existing datasets into citations that don't actually exist. This is called a hallucination. As a researcher, check generated citations to ensure credibility and correctness before sharing or using.
Plagiarism
Using information generated from LLMs without stating so is plagiarizing. Since LLMs aren't always transparent, researchers must be careful not to take someone else's work without providing proper credit and acknowledgment.
Environmental Cost
Generative AI is a heavy consumer of water and energy. Data centers, which house servers to run generative AI technology, use fresh water and rely on community power grids to keep everything running. Data centers have also been found to release large amounts of CO2 into the atmosphere. Texas currently has over 400 data center facilities in five regions with 90 more under construction.
Houston Advanced Research Center. (2025). Powering Texas’ digital economy: Data centers and the future of the grid [Policy brief]. https://harcresearch.org/research/powering-texas-digital-economy-data-centers-and-the-future-of-the-grid/
Privacy
Information you input into generative AI tools becomes the property of the platform and can be used for LLMs, training, or something else. AI companies aren't required to be forthcoming about how they use the data gathered through their tools, so there is no definitive answer about how inputs are used or if they're sold to or shared with third parties. Never input personally identifiable information or original research ideas into any AI platform.
Labor Impacts
AI companies require human labor for sorting through discriminatory or illegal data that is used to inform LLMs. This human labor is often outsourced to other countries or assigned to people who are paid low wages and are not offered benefits. In 2023, Open AI, the company that owns ChatGPT, used a Kenyan firm to filter through content for between $1.32 and $2 a day. The content was so extreme that many employees reported trauma, cuasing the firm to end their contract with OpenAI eight months earlier than scheduled.
Perrigo, B. (2023, Jan.). OpenAI used Kenyan workers on less than $2 per hour to make ChatGPT less toxic. Time. https://time.com/6247678/openai-chatgpt-kenya-workers/
Generative AI (GenAI) is a type of artificial intelligence that produces content such as text, images, or music. GenAI does not actually understand the content it produces. Instead, it makes predictions about the relationships between words, images, and sounds.
GenAI is trained on datasets, called large language models, that allow it to know how grammar, vocabulary, and style contribute to text. It mimics the language structures learned from the data to create coherent sentences.
Machine learning makes it possible for computers to learn from large datasets without being explicitly programmed to do so. This means that performance is continually improved through more data exposure.