From charlesreid1

(Created page with "Well, we made our wordlist with Cewl: <source lang="bash"> #!/bin/sh echo "Creating Seahawks wordlist..." cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wiki...")
 
No edit summary
Line 3: Line 3:
<source lang="bash">
<source lang="bash">
#!/bin/sh
#!/bin/sh
echo "Creating Seahawks wordlist..."
echo "Creating Seahawks wordlist..."
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt
Line 8: Line 9:
</source>
</source>


The only problem is, it's got a lot of garbage.
The only problem is, it's got a lot of garbage. For one, all of the language links from Wikipedia end up contaminating the end of the list:
 
<pre>
Hindi
replace
Indonesian
Italian
Hebrew
Hungarian
Marathi
Dutch
Japanese
Norwegian
Polish
Portuguese
Russian
Serbian
Serbo
Finnish
Swedish
Turkish
Ukrainian
Vietnamese
Chinese
</pre>
 
There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore. In any case, here's a script to clean things up. It drops words shorter than 7 letters, and it filters out some of the garbage at the end:
 
<source lang="bash">
#!/bin/sh
# only keep 7-letter words or longer,
# and throw out the last 50 lines
 
cat seahawks_wikipedia.txt | grep "\w\{7,\}" | head -n -50 > short_seahawks_wikipedia.txt
</source>

Revision as of 19:42, 16 August 2015

Well, we made our wordlist with Cewl:

#!/bin/sh

echo "Creating Seahawks wordlist..."
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt
echo "Done."

The only problem is, it's got a lot of garbage. For one, all of the language links from Wikipedia end up contaminating the end of the list:

Hindi
replace
Indonesian
Italian
Hebrew
Hungarian
Marathi
Dutch
Japanese
Norwegian
Polish
Portuguese
Russian
Serbian
Serbo
Finnish
Swedish
Turkish
Ukrainian
Vietnamese
Chinese

There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore. In any case, here's a script to clean things up. It drops words shorter than 7 letters, and it filters out some of the garbage at the end:

#!/bin/sh
# only keep 7-letter words or longer,
# and throw out the last 50 lines

cat seahawks_wikipedia.txt | grep "\w\{7,\}" | head -n -50 > short_seahawks_wikipedia.txt