Please ignore the <Number> sets at the beginning of each email. Ironically, we broke word processing in BMC.

I've been parsing/indexing a lot of dataleaks (email & password), over than 100 millions of lines and I'd like to share some insights I had while doing it. Just curiosities, but maybe someone can find this points usefull. In short, I'm gonna talk about a defensive point of view. Things that gave me trouble even when the passwords leaked in plain text. Imagine that this are some actions you may take to gain another level of safety, at least against most of the indexers in the wild.

How Parsing and Indexing Works

The first thing you need to have in mind: once your email and password leaked, how they will be processed? If you though about regex, you are right. As you know, data leaks in millions at time, so is virtually impossible to handle by hand. That said, the criteria used by those regexes can be bypassed. Let's talk about email addresses, take a look:

This is what an parser expects to find. No jokes, just human-readable lines. However, you can use almost any symbol you can imagine. So, when it comes to safety, maybe you want to difficult the things. See those

Strange? Actually not, I've found a lot like this. In order, the first example have two "@". Believe me, a lot of parsers will break in this situation. Their regexes will match the first [email protected], then reach the second "@". Maybe they will think this is, in fact, a password. In this case, they will continue until they find a valid email. The second and third example uses pairs of "()" and "<>". The point here is that some databases include this pairs too, so the leaked data will looks more or less like this:

That said, it's probably that the parser try to remove this symbols. Also, it will remove the original ones, resulting in invalid addresses. The last example includes a lot of things. Most part of those symbols will not be accepted on the creation of the email because they can create problems there too. But, if a quotation mark, for example, become a valid symbol, you certainly broke every single parser you can imagine. They gonna break and skip your address while parsing a file.

The same applies to passwords. Including quotations marks, parentheses and the like can let your password pass throug a parser almost invisible. Also, you can use blank spaces. And, if you feel very angry, use non-UTF characters. Those gonna break the entire file at the reading time and will need special intervention. Those are the worst.

The Separators

Other thing you may try is to use common separators on your addresses and passwords. Every single file I've parsed used one of those:

, : ;

My final file use the : separator, like this: <103>[email protected]:password And more than once I've found something like pass:word. Looks simple, but separators are not removed from the original sources because the parsers will need they. And most of they are not prepared to find more separators than columns in each line. This will break the line or index an incomplete password or address. Specially if used at the beginning or at the end of the value: ;pass,word;

Another helpful tip is to use regional characters, especially symbols not used on standard English keyboards. For example, accents (Ex: âãáàäь ུ), specific letters/symbols (çюи影མ), etc.

Fun Fact About the Placeholders

To end our talk, let's talk about placeholders. When a indexer finds a invalid line, he can take two actions: drop the line or index the valid part and concatenate a placeholder to replace the broken part. Some indexers avoid to include placeholders because the database will have more value if all its lines are valid. So, when they find a line with an placeholder, they will drop it. That said, if your password looks like something below, there are a chance that it will be ignored by a indexer:

  • Password

  • <password>

  • (passwd)

  • null

  • " "

  • x

  • xxx

  • ?

  • ???

Mind you, I'm not recommending that you use weak passwords. I'm saying that during indexing, they might be discarded in the process. This should not be considered a security measure, it is just a coincidence.