While traditional adverse media screening tools rely on mainstream sources, anonymous forums remain largely untapped for crime intelligence. I recently explored classifying crimes mentioned in the Swedish forum, Flashback Forum
Process of building and analysing corpus of data Why apply LLM to Online Forums?Anonymous forums like 4Chan and Flashback are often analysed for political sentiment, but their role in crime discussions is relatively underutilised. These platforms host raw, unfiltered discussions where users openly discuss ongoing criminal cases, share unreported incidents, and sometimes even reveal details before they appear in mainstream media. Given the potential of these forums, I set out to explore whether they could serve as a useful alternative data source for crime analysis. Using Signal Sifter, I built a corpus of data from crime-related discussions on a well-known Swedish forum—Flashback. Building a Crime Data Corpus with Signal SifterMy goal was to apply Signal Sifter to a popular site with regular traffic and extensive discussions on crime in Sweden. After some research, I settled on Flashback Forum, which contains multiple boards dedicated to crime and court cases. These discussions offer a unique, crowdsourced view of crime trends and incidents. Flashback, like 4Chan, is structured with boards that host various discussion threads. Each thread consists of posts and replies, making it a rich dataset for text analysis. By leveraging web scraping and natural language processing (NLP), I aimed to identify crime mentions in these discussions. Data Schema and Key InsightsCrime-Related Data:
Metadata:
By ranking threads based on views and replies, I assumed that higher engagement correlated with discussions containing significant crime-related information. Evaluating LLM Effectiveness for Crime IdentificationOnce I had a corpus of 66,000 threads, I processed them using Llama 3.2B Instruct, running locally to avoid token costs associated with cloud-based models. However, hardware limitations were a major bottleneck—parsing 3,700 thread titles on my 8GB RAM laptop took over eight hours. I passed a few examples to the prompt and made it as hard as possible for the bot to misunderstand: Despite the speed limitations, the model performed well in classifying crime mentions. Notably:
Sample Output
Takeaways and Future WorkThis experiment demonstrated that online forums can provide valuable crime-related insights. Using LLMs to classify crime discussions is effective but resource-intensive. Future improvements could include:
Sweden’s crime data challenges persist, but alternative sources like anonymous forums offer new opportunities for OSINT and risk analysis. By refining these methods, we can improve crime trend monitoring and enhance investigative research. This work is part of an ongoing effort to explore unconventional data sources for crime intelligence. If you're interested in OSINT, adverse media analysis, or data-driven crime research, feel free to connect! Let's connect! [link] [comments] |
Friday, February 14, 2025
Identifying Crime Related Data from Anonymous Social Media with AI
Identifying Crime Related Data from Anonymous Social Media with AI: