Cewl/Cleaning Wordlists: Difference between revisions
From charlesreid1
(Created page with "Well, we made our wordlist with Cewl: <source lang="bash"> #!/bin/sh echo "Creating Seahawks wordlist..." cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wiki...") |
No edit summary |
||
| Line 3: | Line 3: | ||
<source lang="bash"> | <source lang="bash"> | ||
#!/bin/sh | #!/bin/sh | ||
echo "Creating Seahawks wordlist..." | echo "Creating Seahawks wordlist..." | ||
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt | cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt | ||
| Line 8: | Line 9: | ||
</source> | </source> | ||
The only problem is, it's got a lot of garbage. | The only problem is, it's got a lot of garbage. For one, all of the language links from Wikipedia end up contaminating the end of the list: | ||
<pre> | |||
Hindi | |||
replace | |||
Indonesian | |||
Italian | |||
Hebrew | |||
Hungarian | |||
Marathi | |||
Dutch | |||
Japanese | |||
Norwegian | |||
Polish | |||
Portuguese | |||
Russian | |||
Serbian | |||
Serbo | |||
Finnish | |||
Swedish | |||
Turkish | |||
Ukrainian | |||
Vietnamese | |||
Chinese | |||
</pre> | |||
There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore. In any case, here's a script to clean things up. It drops words shorter than 7 letters, and it filters out some of the garbage at the end: | |||
<source lang="bash"> | |||
#!/bin/sh | |||
# only keep 7-letter words or longer, | |||
# and throw out the last 50 lines | |||
cat seahawks_wikipedia.txt | grep "\w\{7,\}" | head -n -50 > short_seahawks_wikipedia.txt | |||
</source> | |||
Revision as of 19:42, 16 August 2015
Well, we made our wordlist with Cewl:
#!/bin/sh
echo "Creating Seahawks wordlist..."
cewl -v en.wikipedia.org/wiki/Seattle_Seahawks -d 1 -w seahawks_wikipedia.txt
echo "Done."
The only problem is, it's got a lot of garbage. For one, all of the language links from Wikipedia end up contaminating the end of the list:
Hindi replace Indonesian Italian Hebrew Hungarian Marathi Dutch Japanese Norwegian Polish Portuguese Russian Serbian Serbo Finnish Swedish Turkish Ukrainian Vietnamese Chinese
There's also lots of 3-letter words like "the" - ??? - which thought Cewl would ignore. In any case, here's a script to clean things up. It drops words shorter than 7 letters, and it filters out some of the garbage at the end:
#!/bin/sh
# only keep 7-letter words or longer,
# and throw out the last 50 lines
cat seahawks_wikipedia.txt | grep "\w\{7,\}" | head -n -50 > short_seahawks_wikipedia.txt