You want to build a word cloud, an infographic where the size of a word corresponds to how often it appears in the body of text.
To do this, you'll need data. Write code that takes a long string and builds its word cloud data in a hash map, where the keys are words and the values are the number of times the words occurred.
Think about capitalized words. For example, look at these sentences:
"After beating the eggs, Dana read the next step:"
"Add milk and eggs, then add flour and sugar."
What do we want to do with "After", "Dana", and "add"? In this example, your final hash map should include one "Add" or "add" with a value of 2. Make reasonable (not necessarily perfect) decisions about cases like "After" and "Dana".
Assume the input will only contain words and standard punctuation.
You could make a reasonable argument to use regex in your solution. We won't, mainly because performance is difficult to measure and can get pretty bad.
Are you sure your code handles hyphenated words and standard punctuation?
Are you sure your code reasonably handles the same word with different capitalization?
Try these sentences:
"We came, we saw, we conquered...then we ate Bill's (Mille-Feuille) cake."
"The bill came to five dollars."
We can do this in runtime and space.
The final hash map we return should be the only data structure whose length is tied to n.
We should only iterate through our input string once.
We'll never post on your wall or message your friends.
Once you're logged in, you'll get free full access to this and 4 other questions.
Runtime and memory cost are both . This is the best we can do because we have to look at every character in the input string and we have to return a hash map of every unique word. We optimized to only make one pass over our input and have only one data structure.
We haven't explicitly talked about how to handle more complicated
character sets. How would you make your solution work with
more unicode characters? What changes need to be made
to handle silly sentences like these:
I'm singing ♬ on a ☔ day.
☹ + ☕ = ☺.
We limited our input to letters, hyphenated words and
punctuation. How would you expand your functionality to
include numbers, email addresses, twitter handles, etc.?
How would you add functionality to identify phrases or words
that belong together but aren't hyphenated? ("Fire truck" or
How could you improve your capitalization algorithm?
How would you avoid having duplicate words that are just
plural or singular possessives?