<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://charlesreid1.com/w/index.php?action=history&amp;feed=atom&amp;title=Data_Engineering%2FTwitter_Example</id>
	<title>Data Engineering/Twitter Example - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://charlesreid1.com/w/index.php?action=history&amp;feed=atom&amp;title=Data_Engineering%2FTwitter_Example"/>
	<link rel="alternate" type="text/html" href="https://charlesreid1.com/w/index.php?title=Data_Engineering/Twitter_Example&amp;action=history"/>
	<updated>2026-06-20T04:36:18Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.12</generator>
	<entry>
		<id>https://charlesreid1.com/w/index.php?title=Data_Engineering/Twitter_Example&amp;diff=21833&amp;oldid=prev</id>
		<title>Admin at 08:22, 20 October 2017</title>
		<link rel="alternate" type="text/html" href="https://charlesreid1.com/w/index.php?title=Data_Engineering/Twitter_Example&amp;diff=21833&amp;oldid=prev"/>
		<updated>2017-10-20T08:22:21Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 08:22, 20 October 2017&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=Data Engineering Examples&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;=&lt;/del&gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;=&lt;/ins&gt;=Data Engineering Examples&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;: &lt;/ins&gt;Twitter==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt; &lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;==&lt;/del&gt;Twitter==&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Real time data:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Real time data:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l172&quot;&gt;Line 172:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 170:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=Flags=&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;=&lt;/ins&gt;=Flags&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;=&lt;/ins&gt;=&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Data Engineering]]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[[Category:Data Engineering]]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
	<entry>
		<id>https://charlesreid1.com/w/index.php?title=Data_Engineering/Twitter_Example&amp;diff=21832&amp;oldid=prev</id>
		<title>Admin: Created page with &quot;=Data Engineering Examples=  ==Twitter==  Real time data: * online queries for a web request * offline computations with very low latency * latency and throughput equally impo...&quot;</title>
		<link rel="alternate" type="text/html" href="https://charlesreid1.com/w/index.php?title=Data_Engineering/Twitter_Example&amp;diff=21832&amp;oldid=prev"/>
		<updated>2017-10-20T08:22:02Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;=Data Engineering Examples=  ==Twitter==  Real time data: * online queries for a web request * offline computations with very low latency * latency and throughput equally impo...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;=Data Engineering Examples=&lt;br /&gt;
&lt;br /&gt;
==Twitter==&lt;br /&gt;
&lt;br /&gt;
Real time data:&lt;br /&gt;
* online queries for a web request&lt;br /&gt;
* offline computations with very low latency&lt;br /&gt;
* latency and throughput equally important&lt;br /&gt;
* Hadoop is too high-latency!&lt;br /&gt;
&lt;br /&gt;
Four data problems at twitter:&lt;br /&gt;
* Tweets&lt;br /&gt;
* Timelines&lt;br /&gt;
* Social graphs&lt;br /&gt;
* Search indices&lt;br /&gt;
&lt;br /&gt;
===Tweets===&lt;br /&gt;
&lt;br /&gt;
Definitions of data:&lt;br /&gt;
* Tweet is primary key id, user id, text, timestamp (and replies)&lt;br /&gt;
* Row storage&lt;br /&gt;
* Initially: single table vertically scaled&lt;br /&gt;
* Initially: master-worker replication (writes to master, replication to workers)&lt;br /&gt;
* Initially: Memmcached for reads (rails reads the real database, populates memcached instances)&lt;br /&gt;
&lt;br /&gt;
Issues:&lt;br /&gt;
* Scaling isk space: disk arrays &amp;gt; 800 GB problematic&lt;br /&gt;
* At 3 trillion tweets, disk space 90% utilized&lt;br /&gt;
&lt;br /&gt;
Solutions:&lt;br /&gt;
* Partition: partition by primary key (one cluster holds one set of keys, another holds different)&lt;br /&gt;
* Partition: tweet IDs partitioned by user&lt;br /&gt;
* Partition: tweets by time&lt;br /&gt;
* Try each partition in order, until enough data accumulated&lt;br /&gt;
&lt;br /&gt;
Locality of databases:&lt;br /&gt;
* Memcached: primary key lookup is 1 ms&lt;br /&gt;
* MySQL: primary key lookup is &amp;lt; 10 ms&lt;br /&gt;
* Exploit locality speed by organizing tweets by timestamp (usually only 1 partition checked)&lt;br /&gt;
&lt;br /&gt;
Problems:&lt;br /&gt;
* write throughput&lt;br /&gt;
* Deadlocks in MySQL (if tweet volume gets crazy)&lt;br /&gt;
* Temporal shard creation is manual process&lt;br /&gt;
&lt;br /&gt;
More solutions:&lt;br /&gt;
* Cassandra (NoSQL, non-relational)&lt;br /&gt;
* primary key partition (tweet ID)&lt;br /&gt;
* but also, secondary key on user ID&lt;br /&gt;
&lt;br /&gt;
===Timelines===&lt;br /&gt;
&lt;br /&gt;
Definitions:&lt;br /&gt;
* Timeline is series of tweet ids&lt;br /&gt;
* Query pattern: organized by user ID&lt;br /&gt;
* Operations: append, merge, truncate&lt;br /&gt;
* high velocity bounded vector&lt;br /&gt;
* Space-based&lt;br /&gt;
&lt;br /&gt;
Primitive approach: SQL query:&lt;br /&gt;
* Use the following SQL query (below)&lt;br /&gt;
* Bingo, you have your subquery of followers who are tweeting at anyone, passed into the search for tweets &lt;br /&gt;
* BUT - what happens if you have lots of friends, or if the number of source tweet IDs cant fit in RAM?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
offline vs online computation:&lt;br /&gt;
* Sequences can be stored in memcached (individual timelines)&lt;br /&gt;
* You pass a status to Fanout &lt;br /&gt;
* Fanout is offline, but has a low-latency SLlA&lt;br /&gt;
* Truncate at random intervals, ensuring bounded length&lt;br /&gt;
* What to do on a cache miss?&lt;br /&gt;
* Merge the user timelines&lt;br /&gt;
&lt;br /&gt;
Stats:&lt;br /&gt;
* 2008: 30 TPS, 120 TPS peak&lt;br /&gt;
* 2010: 700 TPS, 2,000 TPS peak&lt;br /&gt;
* 1.2M deliveries per second&lt;br /&gt;
&lt;br /&gt;
Memory hierarchy: &lt;br /&gt;
* Possibilities:&lt;br /&gt;
* Fanout to disk (Lots of IOPS required, even iwth buffering; cost of rebuilding data from other data stores is reasonable; fanout to memory)&lt;br /&gt;
&lt;br /&gt;
Principles:&lt;br /&gt;
* Offline vs online computation&lt;br /&gt;
* Some problems can be pre-computed (if amt of work bounded, and queyr pattern limtied)&lt;br /&gt;
* Must keep memory hierarchy in mind&lt;br /&gt;
* Efficiency of system includes cost of generating data from another source times probability of needing to&lt;br /&gt;
&lt;br /&gt;
===Social Graphs===&lt;br /&gt;
&lt;br /&gt;
Who follows you? Who do you follow? Who have you blocked, etc.&lt;br /&gt;
&lt;br /&gt;
Operations:&lt;br /&gt;
* Enumerate by time&lt;br /&gt;
* Set operations: intersection, difference, union&lt;br /&gt;
* Inclusion, cardinality&lt;br /&gt;
&lt;br /&gt;
Spam problems:&lt;br /&gt;
* Need mass delete ability&lt;br /&gt;
&lt;br /&gt;
Temporal enumeration:&lt;br /&gt;
* Who you followed, listed by when you followed them&lt;br /&gt;
* Inclusion: do they follow you too?&lt;br /&gt;
* Cardinality: How many followers do they have? How many people are following them?&lt;br /&gt;
&lt;br /&gt;
If person A tweets at Person B:&lt;br /&gt;
* Want to deliver tweet to people who follow both person A and person B&lt;br /&gt;
* Original implementation: single table, source_id and destination_id, and each contains ID of source/destination&lt;br /&gt;
* Single table, master-worker replication&lt;br /&gt;
&lt;br /&gt;
What problems did this cause?&lt;br /&gt;
* Write throughput problem&lt;br /&gt;
* Inputs could not be kept in RAM&lt;br /&gt;
&lt;br /&gt;
Solution:&lt;br /&gt;
* Partition by User ID&lt;br /&gt;
* Edges stored in BOTH forward AND backward direction (same tweet stored twice)&lt;br /&gt;
* Indexed by time&lt;br /&gt;
* Indexed by element (for doing set algebra)&lt;br /&gt;
* Partitioned by user: source_id of one and destination_id of other are identical&lt;br /&gt;
&lt;br /&gt;
Challenges:&lt;br /&gt;
* Consistency in the presence of failures&lt;br /&gt;
* Write operations: idempotent (retry until success)&lt;br /&gt;
* Last-write wins for edges&lt;br /&gt;
* Commutative strategies for mass writes&lt;br /&gt;
* Low latency - ms time scale&lt;br /&gt;
&lt;br /&gt;
Principles:&lt;br /&gt;
* Impossible to precompute set algebra queries&lt;br /&gt;
* Simple distributed coordination techniques&lt;br /&gt;
* Partition, replicate, and index&lt;br /&gt;
* Many efficient scalability problems are solved this way: Partition, Replicate, Index&lt;br /&gt;
&lt;br /&gt;
===Search Indices===&lt;br /&gt;
&lt;br /&gt;
Results of searching for, e.g., &amp;quot;mountain dew cheetos&amp;quot;&lt;br /&gt;
&lt;br /&gt;
Search index:&lt;br /&gt;
* Find all tweets with these words in it&lt;br /&gt;
* Posting list&lt;br /&gt;
* Boolean&lt;br /&gt;
* Queries&lt;br /&gt;
* Complex queries&lt;br /&gt;
* Ad hoc queries&lt;br /&gt;
* Relevancy is recency (this ignores the non-real-time component to search...)&lt;br /&gt;
&lt;br /&gt;
Searching for, e.g., mountain dew cheetos is the intersection of three posting lists&lt;br /&gt;
* Original implementation: single table, vertically scaled&lt;br /&gt;
* One column: term ID&lt;br /&gt;
* Another column: document ID&lt;br /&gt;
* Master-worker replication for read throughput&lt;br /&gt;
&lt;br /&gt;
Problems: index cannot be kept in memory&lt;br /&gt;
&lt;br /&gt;
Current implementation:&lt;br /&gt;
* Partitioned by TIME first&lt;br /&gt;
* Then partitioned by term ID...&lt;br /&gt;
* Use delayed key-write&lt;br /&gt;
&lt;br /&gt;
What problems does the solution create:&lt;br /&gt;
* Write throughput issues&lt;br /&gt;
* Queries for rare terms have to search MANY partitions&lt;br /&gt;
* Space efficiency and recall&lt;br /&gt;
* MySQL requires loooots of memory&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Flags=&lt;br /&gt;
&lt;br /&gt;
[[Category:Data Engineering]]&lt;/div&gt;</summary>
		<author><name>Admin</name></author>
	</entry>
</feed>