How to Counteract Forensic Linguistics
Stylometry is our personal and unique writing style. No matter who you are, you have a unique finger printable, and traceable writing style. This has been understood for a while now, and a branch of forensics is built off of this principle: forensic linguistics. In this field, the particular name for forensic linguistics applied to internet crime is called “Writeprint”. Writeprint primarily aims to determine author identification over the internet by comparing a suspect’s text to a known collection of writer invariant (normally written) texts, and even without comparison texts, this forensic technique can yield personal information about an author such as gender, age, and personality.
What does an adversary look for when examining your writing?
- Lexical features: analysis of word choice.
- Syntactic features: analysis of writing style, sentence structure, punctuation, and hyphenation.
- Structural features: analysis of structure and organization of writing.
- Content-specific words: analysis of contextually significant writing such as acronyms.
- Idiosyncratic features: analysis of grammatical errors, this is the most important factor to consider because it provides relatively high accuracy in author identification.
You might think that this is not something that an adversary pays attention to? Think again! There have been multiple cases where adversaries such as law enforcement have used Writeprint techniques to help catch and sentence people. Here are some examples:
- The OxyMonster case (https://arstechnica.com/tech-policy/2018/06/dark-web-vendor-oxymonster-turns-out-to-be-a-frenchman-with-luscious-beard/ [Archive.org]): Public data revealed that Vallerius (a.k.a OxyMonster) has Instagram and Twitter accounts. Agents compared the writing style of “OxyMonster” on the Dream Market forum while in a senior Moderator role to the writing style of Vallerius on his public Instagram and Twitter accounts. Agents discovered many similarities in the use of words and punctuation to including the word “cheers;’’ double exclamation marks; frequent use of quotation marks; and intermittent French post.
Do not use the same writing style for your sensitive activities as for your normal activities. In particular, pay close attention to your use of common phrases, and punctuations. Also, as a side note: limit the amount of reference material that an adversary can use as comparison text, you do not want to find yourself in trouble because of your political Twitter post, or that Reddit post you made years ago, do you?
- Here is another example from the book American Kingpin, about how a DEA agent investigated the writing style of DPR (Dread Pirate Roberts a.k.a Ross Ulbricht, founder of the Silk Road Dark Market) from a unique perspective: For one, Ross Ulbricht used the word “epic” a lot, which showed that he was likely young. He also used emoji smiley faces in his writing, though he never used a hyphen as the nose, writing them as “:)” rather than the old-fashioned “:-)”. Yet the one attribute about Ulbricht that stood out was that rather than writing “yes” or “yeah” on the site’s forums, Ulbricht instead always typed “yea”.
Pay attention to the little things that might add up. If you usually reply with “ok” to people, maybe try to reply with “okay” for your sensitive activities. You should NEVER use words or phrases from your sensitive activities (even if they are not in a public post) for normal purposes, and vice versa. Ross Ulbricht used “frosty” as the name for his Silk Road servers, and for his YouTube account, which helped convince law enforcement that Dread Pirate Roberts was in fact, Ross Ulbricht.
How to counteract the efforts of your adversary:
- Reduce the amount of comparison text for adversaries to compare you with. This goes with having a small online footprint for your normal activities.
- Use a word processor (such as LibreWriter) to fix any grammatical/spelling errors that you regularly encounter.
- Reduce or change the idioms that you use while conducting sensitive activities.
- Understand how your identity affects your writing style: Is your alias younger? Older? More educated? Or less educated? If your identity is older, maybe speak in a more JRR Tolkien style of writing.
- Pay attention to how your slang and spelling might identify you. If you are from the UK, you should say “maths”, but if you are from the US you say “math”. It does not matter how you say “maths”, all that matters is that it can be used to profile you. This also applies to slang as many regions each have different and extremely particular slang. You do not ask someone from the USA for a “rubber” and expect them to give you an “eraser” as an example.
- Pay attention to your use of emoticons and emojis. In the previous example, the DEA agent was able to make a correct assumption that Ulbricht was likely young because he did not use a hyphen when making a smiley emoticon.
- Pay attention to how you structure your writing. Do you use two spaces after a period? Do you constantly use parenthesis in your writing? Do you use the oxford comma?
- Consider what symbols you use in your writing. Do you use €, £ or $? Do you use “dd-mm-yyyy” or “mm-dd-yyyy” for dates? Do you use “08:00 pm” or “20:00” for time?
What different linguistic choices could say about you:
- Russians for example use “)” instead of “:-)” or “:)” to express a smiley face.
- Scandinavians use “=)” instead of “:-)” or “:)” for a smiley face.
- Younger people generally do not use a hyphen in their smiley faces and just use “:)”.
- Two spaces after a period give off the impression that you are quite older because this is how typing was taught to people learning to type with typewriters.
- In the US people write numbers out with commas between numbers to the left of the starting number and with periods between numbers to the right of the starting number. This is in contrast to how people write out numbers on the rest of the planet.
Spelling slang and symbols:
- Obviously, people in different nations use different slang. This is even more pronounced when you use slang that is not as well known in other places such as someone from the UK mentioning a “headmaster” when in other nations it is referred to as a “principal”.
- Spelling is another important factor that is similar to slang, except it is harder to control. If you want to pretend that you are from the USA, but you actually live in Australia, it only takes one time of spelling “colour” as color to let people understand that something is up.
- Some people also spell words in a particular way that is not regional for example you might spell “ax” as “axe” or vice versa.
- Of course, the symbols you use on your keyboard can give a lot of information away, such as £’s or $’s.
Techniques to prevent writeprinting:
Here are some techniques in order of use:
- Spelling and grammar checking: This helps prevent some fingerprinting done using your spelling and grammar mistakes.
- Offline using a word processor: Use a word processor such as LibreWriter and use the spelling and grammar checks features to fix mistakes you might have typed.
- Online using an online service: If you do nothave a word processor available or don’t want to use one, you can also use an online spelling and grammar checker such as Grammarly (this requires an e-mail and an account creation).
After being done with spelling and grammar fixes. Use a website or software such as Google Translate (or for a more privacy-friendly version, https://translate.metalune.xyz) to translate between several different languages before translating back to your original language. These translations back and forth will alter your messages and make fingerprinting more difficult.
Disclaimer: A study archived here: https://web.archive.org/web/20181125133942/https://www.cs.drexel.edu/~sa499/papers/adversarial_stylometry.pdf seems to indicate the translation technique is inefficient to prevent stylometry. This step might be useless.
Search and replace:
Finally, and optionally, add some salt by purposefully adding some mistakes to your messages.
First decide upon a list of words that you frequently do not misspell, maybe the words “grammatical”, “symbol”, and “pronounced” (this list should include more words). Do not use an AutoCorrect automatic replace option for this as it might correct when it does not make sense. Instead, use Search and Replace and do this manually for each word. Do not use “Replace All” either and review each change. This is just the first step, for providing misinformation against linguistic fingerprinting.
Next, find a list of words that you commonly use in your writing. Let us say that I love to use contractions when I write, maybe I always use words such as: “can’t”, “don’t”, “shouldn’t”, “won’t”, or “let’s”. Well, maybe go into LibreWriter and use “Search and Replace” to replace all contractions with the full versions of the words (“can’t” > “cannot”, “don’t” > “do not”, “shouldn’t” > “should not”, “won’t” > “will not”, “let’s” > “let us”). This can make a large difference in your writing and give a difference in how people and most importantly your adversaries perceive you. You can change most words to be different, as an example you can change “huge” to “large”. Just make sure these words fit with your identity.
Now, consider changing your words choices to fit a geographic location. Maybe you live in the US, and you want to give the impression that your identity is from the UK. For example, you can make use of location-based spelling and lexicon. This is risky, and one mistake can give it away.
First off, you need to decide where you want to give the impression of your location. Here is an example to give off the impression that you are from the US, or the UK. First, you will need to understand a thing or two about where your identity is “from”, do not pretend that you are from the UK, yet have no idea about it other than it exists.
After you have decided upon a good location that your identity is from, research the differences in language between the two languages (in this case between UK English and US English). Thanks to the internet, this is quite easy, and you can find Wikipedia pages conveniently highlighting the regional differences of a language between two nations. Pay attention to how certain words are spelled (“metre” > “meter”) and what words are exchanged with each other (“boot” > “trunk”). Now that you have a list of words that can be exchanged with each other, and a list of spelling that are different, use the “Search and Replace” in your editor and change the words such as “colour” into “color”, and “lorry” into “truck”. Again, do not use an AutoCorrect feature or “Replace All” as some changes might not make sense. Review each proposed change. As an example, if you were to use AutoCorrect or “Replace all” on the word “boot” to change into “trunk”, this would make perfect sense in the context of cars. But it would not make any sense in the context of shoes.
Understand that you have to constantly think of what you type and how you type while conducting sensitive activities.
Understand that altering your writing style for such purposes can ultimately change your baseline writing style, ironically making your writing traceable over longer periods.
Proofread yourself at least one time after you are done writing anything to verify you made no mistakes in your process. Trust (yourself) but verify anyway.
- https://www.whonix.org/wiki/Surfing_Posting_Blogging#Stylometry [Archive.org]: Whonix documentation about stylometry.
- https://wikipedia.org/wiki/Forensic_linguistics [Wikiless] [Archive.org]: Gives a brief rundown of the basics of forensic linguistics, not too informative.
- https://wikipedia.org/wiki/Writeprint [Wikiless] [Archive.org]: Gives a brief and informative rundown of forensic linguistics applied to internet investigations.
- https://wikipedia.org/wiki/Stylometry [Wikiless] [Archive.org]: Gives a brief overview of Stylometry.
- https://wikipedia.org/wiki/Content_similarity_detection [Wikiless] [Archive.org]: I would recommend reading this, quite informative.
- https://wikipedia.org/wiki/Author_profiling [Wikiless] [Archive.org]: Read through this as well if you are interested in this topic.
- https://wikipedia.org/wiki/Native-language_identification [Wikiless] [Archive.org]: This is less important if you use a translator, but if you do not use a translator to communicate on forums that are not in your native language, consider giving this a quick read through.
- https://wikipedia.org/wiki/Computational_linguistics [Wikiless] [Archive.org]: Only read through this if this topic is interesting to you.
- https://regmedia.co.uk/2017/09/27/gal_vallerius.pdf [Archive.org]: Explains how authorities used forensic linguistics to help arrest OxyMonster (pages 13 – 14).
- https://wikipedia.org/wiki/Ted_Kaczynski#After_publication [Wikiless] [Archive.org]: May have an IQ of 167, but he was caught primarily based on forensic linguistics.
- https://i.blackhat.com/USA-19/Wednesday/us-19-Wixey-Im-Unique-Just-Like-You-Human-Side-Channels-And-Their-Implications-For-Security-And-Privacy.pdf [Archive.org]: Explains how your writing style can be used to track you, I highly recommend reading through these slides, or watching the accompanying presentation on YouTube.
- https://media.defcon.org/DEF%20CON%2026/DEF%20CON%2026%20presentations/DEFCON-26-Matt-Wixey-Betrayed-by-the-Keyboard-Updated.pdf [Archive.org]: Explains how your writing style can be used to track you, I highly recommend reading through these slides, or watching the accompanying presentation on YouTube, this is quite similar to the last presentation.
- https://i.blackhat.com/us-18/Wed-August-8/us-18-Wixey-Every-ROSE-Has-Its-Thorn-The-Dark-Art-Of-Remote-Online-Social-Engineering.pdf [Archive.org]: This goes over how to potentially spot deception through the internet, and presents a checklist to see how trustworthy someone is. I would advise reading the slides or watching the presentation on YouTube.
Source: The Hitchhiker’s Guide to Online Anonymity, written by AnonyPla © CC BY-NC 4.0