From charlesreid1

m (Admin moved page Cewl/Wordlists to Cewl/Cleaning Wordlists)
No edit summary
Line 56: Line 56:
cat seahawks_wikipedia.txt | grep "\w\{7,\}" | grep -v "^wg" | head -n -50 > short_seahawks_wikipedia.txt
cat seahawks_wikipedia.txt | grep "\w\{7,\}" | grep -v "^wg" | head -n -50 > short_seahawks_wikipedia.txt
</source>
</source>
[[Category:Security]]
[[Category:Passwords]]
[[Category:Kali]]

Revision as of 19:55, 16 August 2015

Cleaning Up Wikipedia Wordlists

This goes through some shortcuts for cleaning up Cewl results from Wikipedia pages.

Making Wikipedia Wordlist

Make your wordlist from Wikipedia with Cewl:

#!/bin/sh

echo "Creating Seahawks wordlist..."
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt
echo "Done."

The Problem

The problem with the resulting wordlist is that there's a lot of garbage. For example, all the language links on Wikipedia add non-sequiturs to your wordlist:

Hindi
replace
Indonesian
Italian
Hebrew
Hungarian
Marathi
Dutch
Japanese
Norwegian
Polish
Portuguese
Russian
Serbian
Serbo
Finnish
Swedish
Turkish
Ukrainian
Vietnamese
Chinese

There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore.

Cleaning Up Cewl Results

Here's a script to drop words shorter than 7 letters, filter out all the stuff at the end, and do some better checking for wikipedia-specific junk:

#!/bin/sh
# only keep 7-letter words or longer,
# and throw out the last 50 lines

cat seahawks_wikipedia.txt | grep "\w\{7,\}" | grep -v "^wg" | head -n -50 > short_seahawks_wikipedia.txt